Computational Statistics & Data Analysis (MVComp2)

Lecture 12: Model selection

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022)
Chapter 5 on Decision Theory
Mehta et al. (2019)
Several figures on bias-variance tradeoff

Recap from last time

Kernel methods
Nonparametric methods do not assume a fixed parametric form, instead they rely on similarity with training points
Kernel trick
Input features are implicitly compared in higher-dimensional (Hilbert) feature space
Gaussian process regression
Posterior distribution. Extends kernel ridge regression to yield prediction errors
Support vector machines
Classification (following large margin principle), but also regression, to make predictions with fewer data points (support vectors)

Decision Theory

Motivation

So far, we’ve talked about building and parametrizing models.

Say we have two candidate models, which one is better?

Bayesian hypothesis testing

Bayes factor1

Hypothesis testing

Two hypotheses or models

  1. Null hypothesis, \(M_0\)
  2. Alternative hypothesis, \(M_1\)

Which one is more likely to be true?

Optimal decision

Pick the alternative hypothesis, \(M_1\), iff \[ p(M_1 | \mathcal{D}) > p(M_0 | \mathcal{D}), \quad \Leftrightarrow \frac{p(M_1 | \mathcal{D})}{p(M_0 | \mathcal{D})} > 1 \]

Uniform prior

Using a uniform prior, \(p(M_0) = p(M_1) = 0.5\), we have (from Bayes theorem): \[ \text{pick} \ M_1 \ \text{iff} \ \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} > 1 \]

Bayes factor

This quantity is the ratio of the marginal likelihoods,2 also known as Bayes factor: \[ \boxed{ B_{1,0} = \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} } \]

Jeffreys scale of evidence1

If \(B_{1,0} > 1\), then we prefer model 1, otherwise we prefer model 0.

Example: Testing if a coin is fair1

Observations
A coin is tossed \(N\) times. It results in \(N_1\) heads and \(N_0 = N-N_1\) tails. We want to decide if the data was generated by a fair coin, \(\theta = 0.5\), or a biased coin, \(\theta \in [0, 1]\).
  • Fair coin: model \(M_0\)
  • Biased coin: model \(M_1\)

Two outcomes: Bernoulli2 likelihood and conjugate prior \(\text{Beta}(\alpha_1, \alpha_0)\) (i.e., the beta-Bernoulli model).

Marginal likelihood under \(M_0\) is simple because the integral over \(\theta\) yields a single case \(\theta=\frac 12\) \[ p(\mathcal{D} | M_0) = \left(\frac 12\right)^{N_1}\left(1-\frac 12\right)^{N-N_1}= \left( \frac 12 \right)^N \]

For \(M_1\) it’s a bit more complicated \[ p(\mathcal{D} | M_1) = \int {\rm d}\theta \, p(\mathcal{D} | \theta) p(\theta) = \frac{B(\alpha_1 + N_1, \alpha_0 + N_0)}{B(\alpha_1, \alpha_0)} \]

Example: Testing if a coin is fair1

Experiment
\(N=5\) and a uniform prior, \(\alpha_1 = \alpha_0 = 1\).
Fair coin
\(p(\mathcal{D} | M_0) = (\frac 12)^5 = 0.03125\).2
Biased coin
See curve on the right.

Results

  • 0 or 5 Heads: \(B_{1,0} \approx 5.3\)
  • 1 or 4 Heads: \(B_{1,0} \approx 1.1\)
  • 2 or 3 Heads: \(B_{1,0} \approx 0.5\)

Interpretation
Would probably choose \(M_0\) for 2 or 3 Heads, \(M_1\) for 0 or 5 Heads, case 1 or 4 is unclear.

Bayesian model selection1

Model selection
Consider a set of \(\mathcal{M}\) models. Pick the most likely!
Decision theory
The action space requires choosing one model, \(m \in \mathcal{M}\). Pick the most probable \[ \hat m = \underset{m \in \mathcal{M}}{\arg \max} \, p(m | \mathcal{D}) = \underset{m \in \mathcal{M}}{\arg \max} \, \frac{p(\mathcal{D} | m)p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} | m) p(m)} \] i.e., maximize the posterior2 over models.
MAP model for uniform priors
For unform priors, \(p(m) = \frac 1{\vert\mathcal{M}\vert}\), the MAP simplifies to \[ \hat m = \underset{m \in \mathcal{M}}{\arg \max} \, p(\mathcal{D} | m) = \underset{m \in \mathcal{M}}{\arg \max} \, \int {\rm d}\pmb \theta \, p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta | m) \] is known as the marginal likelihood, or the evidence for model \(m\).

If all settings of \(\pmb \theta\) assign high probability to the data, this is probably a good model.

Bayesian model selection

5 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=1\)

Bayesian model selection

30 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=2\)

Occam’s razor1

Occam’s razor

Consider two models

  1. Simple model, \(m_1\)
  2. Complex model, \(m_2\)

Both explain the data, i.e., \(p(\mathcal{D} | \pmb{\hat \theta}_1, m_1)\) and \(p(\mathcal{D} | \pmb{\hat \theta}_2, m_2)\) are both large.

Intuitively we should prefer \(m_1\), since it is simpler and just as good as \(m_2\).

Occam’s razor

Occam’s razor

Consider two models

  1. Simple model, \(m_1\)
  2. Complex model, \(m_2\)

Analyze their marginal likelihood. The complex model will spread its probability mass thinly, so the “good” parameters that explain the data (i.e., high likelihood) will represent a smaller proportion. On the other hand, the simpler model has fewer parameters, so the prior is concentrated over a smaller volume, with a larger proportion of “good” parameters.

Connection between cross validation and marginal likelihood1

Marginal likelihood for model \(m\), written in sequential form \[ p(\mathcal{D} | m) = \prod_{n=1}^N p(y_n | y_{1:n-1}, \pmb x_{1:N}, m) = \prod_{n=1}^N p(y_n | \pmb x_n , \mathcal{D}_{1:n-1}, m) \]

We write the individual terms \[ p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) = \int {\rm d}\pmb \theta \, p(y | \pmb x, \pmb \theta) p(\pmb \theta | \mathcal{D}_{1:n-1}, m) \] Use a plug-in approximation (i.e., replacing an expression by its empirical estimate) \[ \begin{align*} p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) &\approx \int {\rm d}\pmb \theta p(y | \pmb x, \pmb \theta) \delta ( \pmb \theta - \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \\ &= p(y | \pmb x, \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \end{align*} \]

Then we get \[ \log p(\mathcal{D} | m) \approx \sum_{n=1}^N \log p(y_n | \pmb x_n, \pmb{\hat \theta}_m(\mathcal{D}_{1:n-1})) \] which is similar to a leave-one-out cross-validation estimate of the likelihood!

Intuition: overly complex model will overfit the “early” examples and predict the remaining ones poorly, and thus also get a low cross-validation score.

Information criteria

Motivation

So far we’ve mostly discussed the marginal likelihood \[ p(\mathcal{D} | m) = \int {\rm d}\pmb \theta\, p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta) \]

which can be difficult to compute, and quite sensitive to the choice of prior.

Alternative
related metrics for model selection known as information criteria.

Laplace approximation to the posterior1

Quadratic (Laplace) approximation2
Write the posterior as follows \[ p(\pmb \theta | \mathcal{D}) = \frac 1Z {\rm e}^{-\varepsilon(\pmb \theta)} \] where \(\varepsilon(\pmb \theta) = -\log p(\pmb \theta, \mathcal{D})\) is called an energy function, and \(Z = p(\mathcal{D})\) is the normalization constant.

Now Taylor expand around the mode \(\pmb{\hat \theta}\) (i.e., the lowest energy state) \[ \varepsilon(\pmb \theta) \approx \varepsilon(\pmb{\hat \theta}) + {\color{grey}(\pmb \theta - \pmb{\hat\theta})^\intercal \pmb g } + \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta}) \] (\({\bf H}\) is the Hessian of the negative log joint, \(-\log p(\pmb \theta, \mathcal{D})\)) and so we get for both the joint distribution and the posterior: \[ \begin{align*} \hat p(\pmb \theta, \mathcal{D}) &= \frac 1Z {\rm e}^{-\varepsilon(\pmb{\hat\theta})} \exp\left[- \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta})\right] \\ \hat p(\pmb \theta | \mathcal{D}) &= \frac 1Z \hat p(\pmb \theta, \mathcal{D}) = \mathcal{N}(\pmb \theta | \pmb{\hat\theta}, {\bf H}^{-1}) \end{align*} \] which means that the posterior approximates a multivariate Gaussian3

Bayesian information criterion (BIC)

Bayesian information criterion (BIC)
Recall the marginal likelihood \[ p(\mathcal{D} | m) = \int {\rm d}\pmb \theta\, \underbrace{p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta)}_{\propto p(\pmb \theta|\mathcal{D})} \]

Make use of the quadratic approximation to the posterior, \(\hat p(\pmb\theta | \mathcal{D})\), and the MAP plug-in approximation when integrating \(\hat p(\pmb\theta | \mathcal{D}) = \mathcal{N}(\pmb{\theta} | \hat{\pmb{\theta}}, \textbf{H}^{-1})\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}_\text{map}) + \log p(\pmb{\hat \theta_\text{map}}) - \frac 12 \log \vert {\bf H} \vert \]

Assume uniform prior and replace MAP estimate with the MLE, \(\pmb{\hat \theta}\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}) - \frac 12 \log \vert {\bf H}\vert \]

Approximating \(\log \vert {\bf H}\vert\)

Without knowing much about \({\bf H}\) (i.e., volume of the posterior distribution), we can argue that \(\log \vert {\bf H}\vert \approx \text{const} + k \log N\), where \(k = \text{dim}(\pmb\theta)\) is the number of model parameters and \(N\) is the number of datapoints.1

Demonstration

\({\bf H} = \sum_{i=1}^N {\bf H}_i = \sum_{i=1}^N \nabla \nabla \log p(\mathcal{D}_i | \pmb{\theta})\). Approximate each \({\bf H}_i\) by a fixed matrix \(\hat {\bf H}\). Then we have \[ \log \vert {\bf H} \vert = \log \vert N\hat{\bf H}\vert = \log(N^k \vert \hat {\bf H}\vert ) = k \log N + \log \vert \hat{\bf H}\vert \] We can drop the \(\log \vert \hat{\bf H}\vert\) term, because it does not depend on \(N\) (will get overwhelmed by the likelihood).

We define the BIC loss2 \[ \mathcal{L}_\text{BIC}(m) = -2 \log p(\mathcal{D} | \pmb{\hat\theta}, m) + k \log N \]

Bias-variance tradeoff

Bias-variance1

Recall the bias-variance tradeoff2
A fundamental tradeoff when picking a method for parameter estimation, assuming we wish to minimize the mean-squared error (MSE) \[ \begin{align} \text{MSE} &= E[(\hat\theta-\theta)^2] \\ &= E[(\hat \theta - E[\hat\theta] + E[\hat\theta] - \theta)^2] \\ &= \text{Var}[\hat\theta] + b^2 \end{align} \]

In-sample vs out-of-sample error1

Bias-variance tradeoff and model complexity1

Recall early stopping in lecture 8

Bias-variance tradeoff1

Summary

Summary

Decision thoery
How do we choose between multiple candidate models?
Bayes factor
Compare models according to their marginal likelihoods (Bayes factor)
Occam’s razor
When unsure, choose the simpler model (fewer parameters, fewer assumptions, etc.)
Bayesian Information Criterion (BIC)
Tradeoff between maximizing likelihood and minimizing model complexity

References

Amendola, Luca. 2021. “Lecture Notes on Statistical Methods.” https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf.
Mehta, Pankaj, Marin Bukov, Ching-Hao Wang, Alexandre GR Day, Clint Richardson, Charles K Fisher, and David J Schwab. 2019. “A High-Bias, Low-Variance Introduction to Machine Learning for Physicists.” Physics Reports 810: 1–124.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.