Lecture 12: Model selection
Institute for Theoretical Physics, Heidelberg University
Please complete by Jan 31, 2024.
So far, we’ve talked about building and parametrizing models.
Say we have two candidate models, which one is better?
Hypothesis testing
Two hypotheses or models
Which one is more likely to be true?
Pick the alternative hypothesis, \(M_1\), iff \[ p(M_1 | \mathcal{D}) > p(M_0 | \mathcal{D}), \quad \Leftrightarrow \frac{p(M_1 | \mathcal{D})}{p(M_0 | \mathcal{D})} > 1 \]
Using a uniform prior, \(p(M_0) = p(M_1) = 0.5\), we have (from Bayes theorem): \[ \text{pick} \ M_1 \ \text{iff} \ \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} > 1 \]
This quantity is the ratio of the marginal likelihoods,2 also known as Bayes factor: \[ \boxed{ B_{1,0} = \frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_0)} } \]
If \(B_{1,0} > 1\), then we prefer model 1, otherwise we prefer model 0.
Two outcomes: Bernoulli2 likelihood and conjugate prior \(\text{Beta}(\alpha_1, \alpha_0)\) (i.e., the beta-Bernoulli model).
Marginal likelihood under \(M_0\) is simple because the integral over \(\theta\) yields a single case \(\theta=\frac 12\) \[ p(\mathcal{D} | M_0) = \left(\frac 12\right)^{N_1}\left(1-\frac 12\right)^{N-N_1}= \left( \frac 12 \right)^N \]
For \(M_1\) it’s a bit more complicated \[ p(\mathcal{D} | M_1) = \int {\rm d}\theta \, p(\mathcal{D} | \theta) p(\theta) = \frac{B(\alpha_1 + N_1, \alpha_0 + N_0)}{B(\alpha_1, \alpha_0)} \]
If all settings of \(\pmb \theta\) assign high probability to the data, this is probably a good model.
5 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=1\)
30 data points; Green: true function; Red + blue: prediction. \(m_\text{MAP}=2\)
Occam’s razor
Consider two models
Both explain the data, i.e., \(p(\mathcal{D} | \pmb{\hat \theta}_1, m_1)\) and \(p(\mathcal{D} | \pmb{\hat \theta}_2, m_2)\) are both large.
Intuitively we should prefer \(m_1\), since it is simpler and just as good as \(m_2\).
Occam’s razor
Consider two models
Analyze their marginal likelihood. The complex model will spread its probability mass thinly, so the “good” parameters that explain the data (i.e., high likelihood) will represent a smaller proportion. On the other hand, the simpler model has fewer parameters, so the prior is concentrated over a smaller volume, with a larger proportion of “good” parameters.
Marginal likelihood for model \(m\), written in sequential form \[ p(\mathcal{D} | m) = \prod_{n=1}^N p(y_n | y_{1:n-1}, \pmb x_{1:N}, m) = \prod_{n=1}^N p(y_n | \pmb x_n , \mathcal{D}_{1:n-1}, m) \]
We write the individual terms \[ p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) = \int {\rm d}\pmb \theta \, p(y | \pmb x, \pmb \theta) p(\pmb \theta | \mathcal{D}_{1:n-1}, m) \] Use a plug-in approximation (i.e., replacing an expression by its empirical estimate) \[ \begin{align*} p(y_n | \pmb x , \mathcal{D}_{1:n-1}, m) &\approx \int {\rm d}\pmb \theta p(y | \pmb x, \pmb \theta) \delta ( \pmb \theta - \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \\ &= p(y | \pmb x, \pmb{\hat\theta}_m(\mathcal{D}_{1:n-1})) \end{align*} \]
Then we get \[ \log p(\mathcal{D} | m) \approx \sum_{n=1}^N \log p(y_n | \pmb x_n, \pmb{\hat \theta}_m(\mathcal{D}_{1:n-1})) \] which is similar to a leave-one-out cross-validation estimate of the likelihood!
Intuition: overly complex model will overfit the “early” examples and predict the remaining ones poorly, and thus also get a low cross-validation score.
So far we’ve mostly discussed the marginal likelihood \[ p(\mathcal{D} | m) = \int {\rm d}\pmb \theta\, p(\mathcal{D} | \pmb \theta, m) p(\pmb \theta) \]
which can be difficult to compute, and quite sensitive to the choice of prior.
Now Taylor expand around the mode \(\pmb{\hat \theta}\) (i.e., the lowest energy state) \[ \varepsilon(\pmb \theta) \approx \varepsilon(\pmb{\hat \theta}) + {\color{grey}(\pmb \theta - \pmb{\hat\theta})^\intercal \pmb g } + \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta}) \] (\({\bf H}\) is the Hessian of the negative log joint, \(-\log p(\pmb \theta, \mathcal{D})\)) and so we get for both the joint distribution and the posterior: \[ \begin{align*} \hat p(\pmb \theta, \mathcal{D}) &= \frac 1Z {\rm e}^{-\varepsilon(\pmb{\hat\theta})} \exp\left[- \frac 12 (\pmb \theta - \pmb{\hat\theta})^\intercal {\bf H}(\pmb \theta - \pmb{\hat\theta})\right] \\ \hat p(\pmb \theta | \mathcal{D}) &= \frac 1Z \hat p(\pmb \theta, \mathcal{D}) = \mathcal{N}(\pmb \theta | \pmb{\hat\theta}, {\bf H}^{-1}) \end{align*} \] which means that the posterior approximates a multivariate Gaussian3
Make use of the quadratic approximation to the posterior, \(\hat p(\pmb\theta | \mathcal{D})\), and the MAP plug-in approximation when integrating \(\hat p(\pmb\theta | \mathcal{D}) = \mathcal{N}(\pmb{\theta} | \hat{\pmb{\theta}}, \textbf{H}^{-1})\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}_\text{map}) + \log p(\pmb{\hat \theta_\text{map}}) - \frac 12 \log \vert {\bf H} \vert \]
Assume uniform prior and replace MAP estimate with the MLE, \(\pmb{\hat \theta}\) \[ \log p(\mathcal{D} | m) \approx \log p(\mathcal{D} | \pmb{\hat\theta}) - \frac 12 \log \vert {\bf H}\vert \]
Approximating \(\log \vert {\bf H}\vert\)
Without knowing much about \({\bf H}\) (i.e., volume of the posterior distribution), we can argue that \(\log \vert {\bf H}\vert \approx \text{const} + k \log N\), where \(k = \text{dim}(\pmb\theta)\) is the number of model parameters and \(N\) is the number of datapoints.1
\({\bf H} = \sum_{i=1}^N {\bf H}_i = \sum_{i=1}^N \nabla \nabla \log p(\mathcal{D}_i | \pmb{\theta})\). Approximate each \({\bf H}_i\) by a fixed matrix \(\hat {\bf H}\). Then we have \[ \log \vert {\bf H} \vert = \log \vert N\hat{\bf H}\vert = \log(N^k \vert \hat {\bf H}\vert ) = k \log N + \log \vert \hat{\bf H}\vert \] We can drop the \(\log \vert \hat{\bf H}\vert\) term, because it does not depend on \(N\) (will get overwhelmed by the likelihood).
We define the BIC loss2 \[ \mathcal{L}_\text{BIC}(m) = -2 \log p(\mathcal{D} | \pmb{\hat\theta}, m) + k \log N \]
Recall early stopping in lecture 8