Computational Statistics & Data Analysis (MVComp2)

Lecture 12: Model selection

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022): Chapter 5 on Decision Theory
Mehta et al. (2019): Several figures on bias-variance tradeoff

Recap from last time

Kernel methods: Nonparametric methods do not assume a fixed parametric form, instead they rely on similarity with training points
Kernel trick: Input features are implicitly compared in higher-dimensional (Hilbert) feature space
Gaussian process regression: Posterior distribution. Extends kernel ridge regression to yield prediction errors
Support vector machines: Classification (following large margin principle), but also regression, to make predictions with fewer data points (support vectors)

Decision Theory

Motivation

So far, we’ve talked about building and parametrizing models.

Say we have two candidate models, which one is better?

Bayesian hypothesis testing

Bayes factor¹

Hypothesis testing

Two hypotheses or models

Null hypothesis, \(M_0\)
Alternative hypothesis, \(M_1\)

Which one is more likely to be true?

Optimal decision

Pick the alternative hypothesis, \(M_1\), iff \[ p(M_1 | \mathcal{D}) > p(M_0 | \mathcal{D}), \quad \Leftrightarrow \frac{p(M_1 | \mathcal{D})}{p(M_0 | \mathcal{D})} > 1 \]

Uniform prior

Using a uniform prior, \(p(M_0) = p(M_1) = 0.5\), we have (from Bayes theorem): \[ \text{pick} \ M_1 \ \text{iff} \ \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} > 1 \]

Bayes factor

This quantity is the ratio of the marginal likelihoods,² also known as Bayes factor: \[ \boxed{ B_{1,0} = \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} } \]

Jeffreys scale of evidence¹

If \(B_{1,0} > 1\), then we prefer model 1, otherwise we prefer model 0.

Example: Testing if a coin is fair¹

Observations: A coin is tossed \(N\) times. It results in \(N_1\) heads and \(N_0 = N-N_1\) tails. We want to decide if the data was generated by a fair coin, \(\theta = 0.5\), or a biased coin, \(\theta \in [0, 1]\).

Fair coin: model \(M_0\)
Biased coin: model \(M_1\)

Two outcomes: Bernoulli² likelihood and conjugate prior \(\text{Beta}(\alpha_1, \alpha_0)\) (i.e., the beta-Bernoulli model).

Marginal likelihood under \(M_0\) is simple because the integral over \(\theta\) yields a single case \(\theta=\frac 12\) \[ p(\mathcal{D} | M_0) = \left(\frac 12\right)^{N_1}\left(1-\frac 12\right)^{N-N_1}= \left( \frac 12 \right)^N \]

For \(M_1\) it’s a bit more complicated \[ p(\mathcal{D} | M_1) = \int {\rm d}\theta \, p(\mathcal{D} | \theta) p(\theta) = \frac{B(\alpha_1 + N_1, \alpha_0 + N_0)}{B(\alpha_1, \alpha_0)} \]

Example: Testing if a coin is fair¹

Experiment: \(N=5\) and a uniform prior, \(\alpha_1 = \alpha_0 = 1\).
Fair coin: \(p(\mathcal{D} | M_0) = (\frac 12)^5 = 0.03125\).²
Biased coin: See curve on the right.

Results

0 or 5 Heads: \(B_{1,0} \approx 5.3\)
1 or 4 Heads: \(B_{1,0} \approx 1.1\)
2 or 3 Heads: \(B_{1,0} \approx 0.5\)

Interpretation: Would probably choose \(M_0\) for 2 or 3 Heads, \(M_1\) for 0 or 5 Heads, case 1 or 4 is unclear.

Bayesian model selection¹

Model selection: Consider a set of \(\mathcal{M}\) models. Pick the most likely!

Decision theory: The action space requires choosing one model, \(m \in \mathcal{M}\). Pick the most probable \[ \hat m = \underset{m \in \mathcal{M}}{\arg \max} \, p(m | \mathcal{D}) = \underset{m \in \mathcal{M}}{\arg \max} \, \frac{p(\mathcal{D} | m)p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} | m) p(m)} \] i.e., maximize the posterior² over models.

MAP model for uniform priors: For unform priors, \(p(m) = \frac 1{\vert\mathcal{M}\vert}\), the MAP simplifies to \[ \hat m = \underset{m \in \mathcal{M}}{\arg \max} \, p(\mathcal{D} | m) = \underset{m \in \mathcal{M}}{\arg \max} \, \int {\rm d}\pmb \theta \, p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta | m) \] is known as the marginal likelihood, or the evidence for model \(m\).

If all settings of \(\pmb \theta\) assign high probability to the data, this is probably a good model.

Bayesian model selection

5 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=1\)

Bayesian model selection

30 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=2\)

Occam’s razor¹

Occam’s razor

Consider two models

Simple model, \(m_1\)
Complex model, \(m_2\)

Both explain the data, i.e., \(p(\mathcal{D} | \pmb{\hat \theta}_1, m_1)\) and \(p(\mathcal{D} | \pmb{\hat \theta}_2, m_2)\) are both large.

Intuitively we should prefer \(m_1\), since it is simpler and just as good as \(m_2\).

Occam’s razor

Occam’s razor

Consider two models

Simple model, \(m_1\)
Complex model, \(m_2\)

Analyze their marginal likelihood. The complex model will spread its probability mass thinly, so the “good” parameters that explain the data (i.e., high likelihood) will represent a smaller proportion. On the other hand, the simpler model has fewer parameters, so the prior is concentrated over a smaller volume, with a larger proportion of “good” parameters.

Connection between cross validation and marginal likelihood¹

Marginal likelihood for model \(m\), written in sequential form \[ p(\mathcal{D} | m) = \prod_{n=1}^N p(y_n | y_{1:n-1}, \pmb x_{1:N}, m) = \prod_{n=1}^N p(y_n | \pmb x_n , \mathcal{D}_{1:n-1}, m) \]

We write the individual terms \[ p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) = \int {\rm d}\pmb \theta \, p(y | \pmb x, \pmb \theta) p(\pmb \theta | \mathcal{D}_{1:n-1}, m) \] Use a plug-in approximation (i.e., replacing an expression by its empirical estimate) \[ \begin{align*} p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) &\approx \int {\rm d}\pmb \theta p(y | \pmb x, \pmb \theta) \delta ( \pmb \theta - \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \\ &= p(y | \pmb x, \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \end{align*} \]

Then we get \[ \log p(\mathcal{D} | m) \approx \sum_{n=1}^N \log p(y_n | \pmb x_n, \pmb{\hat \theta}_m(\mathcal{D}_{1:n-1})) \] which is similar to a leave-one-out cross-validation estimate of the likelihood!

Intuition: overly complex model will overfit the “early” examples and predict the remaining ones poorly, and thus also get a low cross-validation score.

Information criteria

Motivation

So far we’ve mostly discussed the marginal likelihood \[ p(\mathcal{D} | m) = \int {\rm d}\pmb \theta\, p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta) \]

which can be difficult to compute, and quite sensitive to the choice of prior.

Alternative: related metrics for model selection known as information criteria.

Laplace approximation to the posterior¹

Quadratic (Laplace) approximation²: Write the posterior as follows \[ p(\pmb \theta | \mathcal{D}) = \frac 1Z {\rm e}^{-\varepsilon(\pmb \theta)} \] where \(\varepsilon(\pmb \theta) = -\log p(\pmb \theta, \mathcal{D})\) is called an energy function, and \(Z = p(\mathcal{D})\) is the normalization constant.

Now Taylor expand around the mode \(\pmb{\hat \theta}\) (i.e., the lowest energy state) \[ \varepsilon(\pmb \theta) \approx \varepsilon(\pmb{\hat \theta}) + {\color{grey}(\pmb \theta - \pmb{\hat\theta})^\intercal \pmb g } + \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta}) \] (\({\bf H}\) is the Hessian of the negative log joint, \(-\log p(\pmb \theta, \mathcal{D})\)) and so we get for both the joint distribution and the posterior: \[ \begin{align*} \hat p(\pmb \theta, \mathcal{D}) &= \frac 1Z {\rm e}^{-\varepsilon(\pmb{\hat\theta})} \exp\left[- \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta})\right] \\ \hat p(\pmb \theta | \mathcal{D}) &= \frac 1Z \hat p(\pmb \theta, \mathcal{D}) = \mathcal{N}(\pmb \theta | \pmb{\hat\theta}, {\bf H}^{-1}) \end{align*} \] which means that the posterior approximates a multivariate Gaussian³

Bayesian information criterion (BIC)

Bayesian information criterion (BIC): Recall the marginal likelihood \[ p(\mathcal{D} | m) = \int {\rm d}\pmb \theta\, \underbrace{p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta)}_{\propto p(\pmb \theta|\mathcal{D})} \]

Make use of the quadratic approximation to the posterior, \(\hat p(\pmb\theta | \mathcal{D})\), and the MAP plug-in approximation when integrating \(\hat p(\pmb\theta | \mathcal{D}) = \mathcal{N}(\pmb{\theta} | \hat{\pmb{\theta}}, \textbf{H}^{-1})\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}_\text{map}) + \log p(\pmb{\hat \theta_\text{map}}) - \frac 12 \log \vert {\bf H} \vert \]

Assume uniform prior and replace MAP estimate with the MLE, \(\pmb{\hat \theta}\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}) - \frac 12 \log \vert {\bf H}\vert \]

Approximating \(\log \vert {\bf H}\vert\)

Without knowing much about \({\bf H}\) (i.e., volume of the posterior distribution), we can argue that \(\log \vert {\bf H}\vert \approx \text{const} + k \log N\), where \(k = \text{dim}(\pmb\theta)\) is the number of model parameters and \(N\) is the number of datapoints.¹

Demonstration

\({\bf H} = \sum_{i=1}^N {\bf H}_i = \sum_{i=1}^N \nabla \nabla \log p(\mathcal{D}_i | \pmb{\theta})\). Approximate each \({\bf H}_i\) by a fixed matrix \(\hat {\bf H}\). Then we have \[ \log \vert {\bf H} \vert = \log \vert N\hat{\bf H}\vert = \log(N^k \vert \hat {\bf H}\vert ) = k \log N + \log \vert \hat{\bf H}\vert \] We can drop the \(\log \vert \hat{\bf H}\vert\) term, because it does not depend on \(N\) (will get overwhelmed by the likelihood).

We define the BIC loss² \[ \mathcal{L}_\text{BIC}(m) = -2 \log p(\mathcal{D} | \pmb{\hat\theta}, m) + k \log N \]

Bias-variance tradeoff

Bias-variance¹

Recall the bias-variance tradeoff²: A fundamental tradeoff when picking a method for parameter estimation, assuming we wish to minimize the mean-squared error (MSE) \[ \begin{align} \text{MSE} &= E[(\hat\theta-\theta)^2] \\ &= E[(\hat \theta - E[\hat\theta] + E[\hat\theta] - \theta)^2] \\ &= \text{Var}[\hat\theta] + b^2 \end{align} \]

In-sample vs out-of-sample error¹

Bias-variance tradeoff and model complexity¹

Recall early stopping in lecture 8

Bias-variance tradeoff¹

Summary

Decision thoery: How do we choose between multiple candidate models?
Bayes factor: Compare models according to their marginal likelihoods (Bayes factor)
Occam’s razor: When unsure, choose the simpler model (fewer parameters, fewer assumptions, etc.)
Bayesian Information Criterion (BIC): Tradeoff between maximizing likelihood and minimizing model complexity

References

Amendola, Luca. 2021. “Lecture Notes on Statistical Methods.” https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf.

Mehta, Pankaj, Marin Bukov, Ching-Hao Wang, Alexandre GR Day, Clint Richardson, Charles K Fisher, and David J Schwab. 2019. “A High-Bias, Low-Variance Introduction to Machine Learning for Physicists.” Physics Reports 810: 1–124.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Course evaluation

Introduction

Literature

Recap from last time

Decision Theory

Motivation

Bayesian hypothesis testing

Bayes factor1

Optimal decision

Uniform prior

Bayes factor

Jeffreys scale of evidence1

Example: Testing if a coin is fair1

Example: Testing if a coin is fair1

Results

Bayesian model selection1

Bayesian model selection

Bayesian model selection

Occam’s razor1

Occam’s razor

Connection between cross validation and marginal likelihood1

Information criteria

Motivation

Laplace approximation to the posterior1

Bayesian information criterion (BIC)

Demonstration

Bias-variance tradeoff

Bias-variance1

In-sample vs out-of-sample error1

Bias-variance tradeoff and model complexity1

Bias-variance tradeoff1

Summary

Summary

References

Bayes factor¹

Jeffreys scale of evidence¹

Example: Testing if a coin is fair¹

Example: Testing if a coin is fair¹

Bayesian model selection¹

Occam’s razor¹

Connection between cross validation and marginal likelihood¹

Laplace approximation to the posterior¹

Bias-variance¹

In-sample vs out-of-sample error¹

Bias-variance tradeoff and model complexity¹

Bias-variance tradeoff¹