Computational Statistics & Data Analysis (MVComp2)

Lecture 8: Regularization

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

  • Chapter 4 in Murphy (2022)

Recap from last time

Least-squares
Linear regression is linear in the parameters, not the input features
Ordinary least squares
Linear algebra provides closed-form solution, also for heteroskedastic problems
Goodness of fit
Measure accuracy of fit using RSS or RMSE, coefficient of determination is intuitive
Generalized linear models
Family of exponential distributions have their expected value depend linearly on the inputs; MLE is unique

Regularization in statistics

Motivation: Generalization1

Overfitting
MLE will minimize loss on the training set, but this may not result in a model with low loss on unseen data.

Example: Probability of heads when tossing a coin

Experiment
Toss it \(N=3\) times, observe 3 heads.
Maximum likelihood estimate (Bernoulli distribution)
\(\hat\theta_\text{mle} = N_1 / (N_0 + N_1) = 3 / (3+0) = 1\)
Model
Make predictions with \(\text{Ber}(y | \hat\theta_\text{mle})\) will predict that all future coin tosses also be heads!

Motivation: Generalization

What happened?

The model assumes that the empirical distribution (used for testing) is the same as the true distribution. Putting all the probability mass on the observed set of \(N\) examples will not leave over any probability for novel data in the future.

Solution: Regularization

Add a penalty term to the NLL \[ \mathcal{L}({\bf\theta}; \lambda) = \left[ \frac 1N \sum_{n=1}^N \ell(y_n, {\bf \theta}; {\bf x}_n) \right] + \lambda C(\theta) \] where \(\lambda \geq 0\) is a regularization parameter, and \(C(\theta)\) is some form of complexity penalty.

Motivation: Polynomial fitting1

Maximum-a-posteriori (MAP) estimation1

Complexity penalty
common expression: \(C(\theta) = -\log p(\theta)\).2
Regularized objective
combination of log likelihood and complexity penalty \[ \begin{align*} \mathcal{L}({\bf \theta}, \lambda) &= - \frac 1N \sum_{n=1}^N \log p({\bf y}_n | {\bf x}_n, \theta) - \lambda \log p(\theta) \\ &\stackrel{\lambda=1}{=} - \left[ \log p(\mathcal{D}|{\bf \theta}) + \log p({\bf \theta}) \right] \end{align*} \] Minimizing \(\mathcal{L}\) is equivalant to maximizing the log posterior \[ \begin{align*} \hat {\bf \theta} &= \underset{\bf \theta}{\arg\max} \log p({\bf \theta} | \mathcal{D}) \\ &= \underset{\bf \theta}{\arg\max} [\log p(\mathcal{D}|{\bf \theta}) + \log p({\bf \theta}) - \text{const}] \end{align*} \]

Maximum-a-posteriori (MAP) estimation can be seen as a regularization of MLE.

Determine regularizer using a validation set1

Difficulties in choosing the strength of the regularizer, \(\lambda\)

  • Small value of \(\lambda\) will focus on minimizing likelihood / empirical risk: overfitting
  • Large value of \(\lambda\) will stay close to the prior: underfitting

Split the data in two disjoint sets!2

  1. Training set: \(\mathcal{D}_\text{train}\)
  2. Validation set: \(\mathcal{D}_\text{valid}\)

Define regularized empirical risk

\[ R_\lambda (\theta, \mathcal{D}) = \frac 1{\vert \mathcal{D} \vert} \sum_{({\bf x, y} \in \mathcal{D})} \ell({\bf y}, f({\bf x}; {\bf \theta})) + \lambda C(\theta) \]

Optimize \(\lambda\) on training, evaluate on validation

  1. Optimal parameters on training set: \[ \hat \theta_\lambda(\mathcal{D}_\text{train}) = \underset{\bf \theta}{\arg\min} R_\lambda(\theta, \mathcal{D}_\text{train}) \]
  2. Compute the validation risk (i.e., estimate of the population risk): \[ R_\lambda^\text{val} = R_0(\hat\theta_\lambda(\mathcal{D}_\text{train}), \mathcal{D}_\text{valid}) \]
  3. Pick the regularizer with best validation performance: \[ \lambda^* = \underset{\lambda \in \mathcal{S}}{\arg\min} R_\lambda^\text{val} \]

Cross-validation1

  • Leaving out part of the data for validation is painful when the dataset is small
  • Simple and popular solution: cross validation (CV)
  • Train on all the folds but the \(k\)’th, and test on the \(k\)’th, in a round-robin fashion

Early stopping1

Very easy: monitor performance on the validation set, stop when the loss shows signs of overfitting.

Regularization in linear regression

Polynomial regression example1

\[ f(x; {\bf w}) = \sum_{d=0}^D w_d x^d = {\bf w}^\intercal[1, x, x^2, \dots, x^D] \]

Ridge (\(\ell_2\)) regularization1

Strategy
Penalize the magnitude of the weights (regression coefficients). Use a zero-mean Gausian prior, \(p({\bf w}) = \mathcal{N}({\bf w} | {\bf 0}, \lambda^{-1} \mathbb{I})\). MAP estimate of the weights is given by \[ \boxed{ \hat {\bf w}_\text{map} = \underset{\bf w}{\arg\min} \left[\text{NLL}({\bf w}) + \lambda \Vert{\bf w}\Vert_2^2\right] } \] is called \(\ell_2\) regularization or weight decay.

In the case of linear regression, this scheme is called ridge regression.

\[ \Vert {\bf w} \Vert_2 = \sqrt{\sum_{d=1}^D \vert w_d\vert^2} = \sqrt{{\bf w}^\intercal {\bf w}} \] where we do not penalize the offset term \(w_0\), since that only affects the global mean and does not contribute to overfitting.

Ridge regression1

MAP estimate is given by the penalized objective \[ \mathcal{L}_\lambda^\text{ridge}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_2^2 \]

The derivative is given by2 \[ \nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 2({\bf X^\intercal Xw - X^\intercal y} + \lambda {\bf w}) \] solving \(\nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 0\) yields the optimal weights \[ \boxed{ \hat {\bf w}_\text{map} = ({\bf X^\intercal X + \lambda \mathbb{I}})^{-1} {\bf X^\intercal y} } \]

Ridge (\(\ell_2\)) regularization

Recall Bayesian linear regression with Gaussian prior \[ \begin{align*} p({\bf w} | {\bf X}, {\bf y}, \sigma^2) &\propto \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) = \mathcal{N}({\bf w} | \tilde{\bf w}, \tilde{\bf \Sigma}) \\ \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y}) \\ \tilde{\bf \Sigma} &= \left({\bf \breve{\Sigma}}^{-1} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf X}\right)^{-1} \end{align*} \]

Set simple prior: \({\bf \breve{w}} = {\bf 0}\), \({\bf \breve{\Sigma}} = \tau^2 {\bf I}\). Define \(\lambda = \sigma^2 / \tau^2\).

Posterior mean becomes: \[ \begin{align*} \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y})\\ &= \frac 1{\sigma^2} \tilde{\bf \Sigma} {\bf X}^\intercal{\bf y} \\ &= ({\bf X}^\intercal{\bf X} + \lambda {\bf I})^{-1} {\bf X}^\intercal{\bf y} \end{align*} \]

Solving RSS using QR decomposition1

Recall the OLS solution \[ {\bf \hat w} = {\bf X}^\dagger{\bf y} = ({\bf X}^\intercal{\bf X})^{-1} {\bf X}^\intercal {\bf y} \] Inverting \({\bf X}^\intercal{\bf X}\) is dangerous: may be ill conditioned or singular. Instead, let \({\bf X} = {\bf QR}\), where \({\bf Q}^\intercal{\bf Q} = {\bf I}\). OLS is equivalent to solving the system of linear equations \({\bf Xw} = {\bf y}\), such that \[ \begin{align*} ({\bf QR}){\bf w} &= {\bf y}\\ {\bf Q^\intercal QR}{\bf w} &= {\bf Q^\intercal y} \\ w &= {\bf R}^{-1} ({\bf Q}^\intercal {\bf y}) \end{align*} \] because \({\bf R}\) is upper triangular, we can solve this last set of equations using backsubstitution, thus avoiding matrix inversion.2

Solving ridge using QR decomposition

Emulate the prior by adding “virtual data” to the training set \[ {\bf \tilde X} = \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix}, \quad {\bf \tilde y} = \begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}_{D \times 1} \end{pmatrix} \] where \({\bf \Lambda} = (1/\tau^2) {\bf I}\).

One can show that the RSS on this expanded data is equivalent to penalized RSS on the original data: \[ \begin{align*} f({\bf w}) &= ({\bf y} - {\bf \tilde{X}}{\bf w})^\intercal ({\bf y} - {\bf \tilde{X}}{\bf w})\\ &= \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right)^\intercal \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right) \\ &= \frac 1{\sigma^2} ({\bf y} - {\bf X}{\bf w})^\intercal ({\bf y} - {\bf X}{\bf w}) + {\bf w}^\intercal {\bf \Lambda}{\bf w} \end{align*} \] can be solved using OLS methods. Compute the QR decomposition of \({\bf \tilde{X}}\). Takes \(O((N+D)D^2)\) time.

Lasso regression1

Ridge
Gaussian prior encouraged the norm of the parameters to be small
Lasso
We seek a sparse solution, i.e., most weights should be exactly 0

Lasso

Minimizing the \(\ell_0\)-norm \[ \Vert{\bf w}\Vert_0 = \sum_{d=1}^D \mathbb{I} (\vert w_d \vert > 0) \] performs feature selection (unfortunately, it’s not convex).

MAP estimation with a Laplace prior1

Laplace distribution

Laplace distribution for the prior \[ p({\bf w} | \lambda) = \prod_{d=1}^D {\rm e}^{-\lambda \vert w_d \vert} \] puts more density on 0 than \(\mathcal{N}({\bf w} | 0, \sigma^2)\).

MAP estimate with a Laplace prior is called \(\ell_1\) regularization2 \[ \mathcal{L}_\lambda^\text{lasso}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_1 \]

Why \(\ell_1\) regularization yields sparse solutions1

  • Left: \(\ell_1\) regularization of a least-squares problem2
  • Right: \(\ell_2\) regularization of a least-squares problem

Elastic net (ridge and lasso combined)1

Combination of ridge and lasso can offer an implicit grouping of correlated variables \[ \mathcal{L}({\bf w}, \lambda_1, \lambda_2) = \Vert {\bf y - Xw}\Vert^2 + \lambda_2 \Vert {\bf w} \Vert_2^2 + \lambda_1 \Vert {\bf w} \Vert_1 \]

Robust linear regression: Huber loss1

Define the Huber loss \[ \ell_\text{huber}(r, \delta) = \left\{ \begin{align*} r^2/2, & \text{ if } |r| \leq \delta \\ \delta |r| - \delta^2 / 2, & \text{ if } |r| > \delta \end{align*}\right. \] is equivalent to \(\ell_2\) for errors smaller than \(\delta\) and is equivalent to \(\ell_1\) for larger errors.

Advantage: everywhere differentiable. Faster than using the Laplace likelihood, because we can replace linear programming by smooth optimization methods (e.g., SGD).

Flavors of linear regression

Note

  • Likelihood: \(p(y | {\bf x}, {\bf w}, \sigma^2)\); Prior: \(p({\bf w})\); Posterior: \(p({\bf w} | \mathcal{D})\)
  • Point: degenerate distribution, \(\delta({\bf w} - \hat{\bf w})\), where \(\hat {\bf w}\) is the MAP estimate
  • MLE is equivalent to using a point posterior and a uniform prior

Summary

Summary

Generalization
Dangers of overfitting and underfitting are always present
Maximum-a-posteriori (MAP) estimate
Regularizes the negative log-likelihood by the log of the prior
Cross-validation
Common technique to choose regularization parameter
Regularization in linear regression
Ridge (\(\ell_2\)) and Lasso (\(\ell_0\) or \(\ell_1\)) constrain the strength and the number of weights, respectively

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.