• No results found

Model Selection

In document Doctor of Philosophy (Page 61-65)

One of the main motivations of this thesis is to do a multi-scenario analysis on the ex- perimentally available binned data, to obtain a data-based selection of a best case and ranking and weighting of the remaining cases. For this purpose, different models can be compared with regard to their model fit by computing a ∆χ2 test. This test allows to decide whether a given model fits significantly better or worse than a competing model.

2.5.1 ∆χ2 test

A ∆χ2 test is useful when the competing models are nested. Two models are considered

“nested” if one is a subset or extension of the other, i.e. one of the models could be obtained by fixing or eliminating parameters in the other model. When the model with the fewer free parameters (null, in many cases) is true, and when certain conditions are satisfied, Wilks’ Theorem [137] says that this difference (∆χ2) should have a χ2 distribution with the number of degrees of freedom equal to the difference in the number of free parameters in the two models. This lets one compute ap-value and then compare it to a critical value to decide whether to reject the null model in favor of the alternative model.

Considering the simplicity of this technique, we need to remember that unlike the Akaike information criterion (AICc) or the Schwarz-Bayesian Criterion (BIC) [138], which incorporate the concept of parsimony and can be applied to nested as well as non-nested models,∆χ2 test, can only be applied to nested models.

One of the most powerful and reliable methods for model comparison is cross vali- dation. The most straightforward (and also most expensive) flavor of cross validation is leave-one-out cross validation (LOOCV). It simultaneously tests the predictive power of the model as well as minimizes the bias and variance together. In LOOCV, one of the data points is left out and the rest of the sample (training set) is optimized. Then that result is used to find the predicted residual for the left-out data point. This process is repeated for all data points and a mean-squared error (MSE) is obtained. For model selection, this MSE is minimized. Unfortunately, it is computationally very expensive. Hence we have to find out some other reasonable method for model selection.

For that goal, we have made use of information-theoretic approaches, especially of the second order Akaike information criterion (AICc) in the analysis of empirical data. It has been shown that LOOCV is asymptotically equivalent to minimizing AIC [139]

2.5.2 Introduction to AICc

The ‘concept of parsimony’ [140] dictates that a model representing the truth should be obtained with “... the smallest possible number of parameters for adequate representation of the data.” In general, bias decreases and variance increases as the dimension of the model increases. Often, the number of parameters in a model is used as a measure of the degree of structure inferred from the data. The fit of any model can be improved by increasing the number of parameters. Parsimonious models achieve a proper trade-off between bias and variance. All model selection methods are based to some extent on the principle of parsimony [141].

In information theory, the Kullback-Leibler (K-L) Information or measureI(f, g) de- notes the information lost when g is used to approximatef. Heref is a notation for full reality or truth and g denotes an approximating model in terms of probability distribu- tion. I(f, g) can also be defined between the ‘best’ approximating model and a competing one. Akaike, in his seminal paper [142] proposed the use of the K-L information as a fun- damental basis for model selection. However, K-L distance cannot be computed without full knowledge of both f (full reality) and the parameters (Θ) in each of the candidate models gi(x|Θ) (a model gi with parameter-set Θ explaining data x). Akaike found a rigorous way to estimate K-L information, based on the empirical log-likelihood function at its maximum point.

‘Akaike’s information criterion’(AIC) with respect to our analysis can be defined as,

AIC =χ2min+ 2K (2.15)

where K is the number of estimable parameters. In application, one computes AIC for each of the candidate models and selects the model with the smallest value of AIC. It is this model that is estimated to be “closest” to the unknown reality that generated the data, from among the candidate models considered.

While Akaike derived an estimator of K-L information, AIC may perform poorly if there are too many parameters in relation to the size of the sample. Sugiura [143] derived a second-order variant of AIC,

AICc2min+ 2K+2K(K+ 1)

n−K−1 (2.16)

where n is the sample size. As a rule of thumb, Use of AICc is preferred in literature when n/K < 40. There are various other such information criteria defined later on, e.g.

QAIC, QAICc, TIC etc. In this analysis, we consistently use AICc.

Whereas AICc are all on a relative (or interval) scale and are strongly dependent on sample size, simple differences of AICc values (∆AICi = AICic−AICminc ) allow estimates of the relative expected K-L differences between f and gi(x|Θ). This allows a quick comparison and ranking of candidate models. The model estimated to be best has ∆AICi

AICmin = 0. The larger ∆AICi is, the less plausible it is that the fitted modelgi(x|Θ) is the K-L best model, given the data x. Table 2.1 lists rough rule-of-thumb values of ∆AICi for analysis of nested models.

AICi Level of Empirical Support for Model i

0−2 Substantial

4−7 Considerably Less

>10 Essentially None

Table 2.1: Rough rule-of-thumb values of ∆AICi for analysis of nested models.

While the ∆AICi are useful in ranking the models, it is possible to quantify the plausi- bility of each model as being the actual K-L best model. This can be done by extending the concept of the likelihood of the parameters given both the data and model, i.e. L(Θ|x, gi), to the concept of the likelihood of the model given the data, henceL(gi|x);

L(gi|x)∝e(−∆AICi /2). (2.17) Such likelihoods represent the relative strength of evidence for each model [144].

To better interpret the relative likelihood of a model, given the data and the set ofR models, we normalize the L(gi|x) to be a set of positive Akaike weights, wi , adding up

to 1:

wi= e(−∆AICi /2) PR

r=1e(−∆AICr /2) (2.18)

A given wi is considered as the weight of evidence in favor of model i being the actual K-L best model for the situation at hand, given that one of theR models must be the K-L best model of that set. Thewi depend on the entire set; therefore, if a model is added or dropped during a post hoc analysis, the wi must be recomputed for all the models in the newly defined set.

Test of new physics sensitivity of the observables in B → D (∗) τ ν τ

decay using Optimal-observable


In document Doctor of Philosophy (Page 61-65)