Computational Statistics & Data Analysis (MVComp2)

Lecture 8: Regularization

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Chapter 4 in Murphy (2022)

Recap from last time

Least-squares: Linear regression is linear in the parameters, not the input features
Ordinary least squares: Linear algebra provides closed-form solution, also for heteroskedastic problems
Goodness of fit: Measure accuracy of fit using RSS or RMSE, coefficient of determination is intuitive
Generalized linear models: Family of exponential distributions have their expected value depend linearly on the inputs; MLE is unique

Regularization in statistics

Motivation: Generalization¹

Overfitting: MLE will minimize loss on the training set, but this may not result in a model with low loss on unseen data.

Example: Probability of heads when tossing a coin

Experiment: Toss it \(N=3\) times, observe 3 heads.
Maximum likelihood estimate (Bernoulli distribution): \(\hat\theta_\text{mle} = N_1 / (N_0 + N_1) = 3 / (3+0) = 1\)
Model: Make predictions with \(\text{Ber}(y | \hat\theta_\text{mle})\) will predict that all future coin tosses also be heads!

Motivation: Generalization

What happened?

The model assumes that the empirical distribution (used for testing) is the same as the true distribution. Putting all the probability mass on the observed set of \(N\) examples will not leave over any probability for novel data in the future.

Solution: Regularization

Add a penalty term to the NLL \[ \mathcal{L}({\bf\theta}; \lambda) = \left[ \frac 1N \sum_{n=1}^N \ell(y_n, {\bf \theta}; {\bf x}_n) \right] + \lambda C(\theta) \] where \(\lambda \geq 0\) is a regularization parameter, and \(C(\theta)\) is some form of complexity penalty.

Motivation: Polynomial fitting¹

Maximum-a-posteriori (MAP) estimation¹

Complexity penalty: common expression: \(C(\theta) = -\log p(\theta)\).²
Regularized objective: combination of log likelihood and complexity penalty \[ \begin{align*} \mathcal{L}({\bf \theta}, \lambda) &= - \frac 1N \sum_{n=1}^N \log p({\bf y}_n | {\bf x}_n, \theta) - \lambda \log p(\theta) \\ &\stackrel{\lambda=1}{=} - \left[ \log p(\mathcal{D}|{\bf \theta}) + \log p({\bf \theta}) \right] \end{align*} \] Minimizing \(\mathcal{L}\) is equivalant to maximizing the log posterior \[ \begin{align*} \hat {\bf \theta} &= \underset{\bf \theta}{\arg\max} \log p({\bf \theta} | \mathcal{D}) \\ &= \underset{\bf \theta}{\arg\max} [\log p(\mathcal{D}|{\bf \theta}) + \log p({\bf \theta}) - \text{const}] \end{align*} \]

Maximum-a-posteriori (MAP) estimation can be seen as a regularization of MLE.

Determine regularizer using a validation set¹

Difficulties in choosing the strength of the regularizer, \(\lambda\)

Small value of \(\lambda\) will focus on minimizing likelihood / empirical risk: overfitting
Large value of \(\lambda\) will stay close to the prior: underfitting

Split the data in two disjoint sets!²

Training set: \(\mathcal{D}_\text{train}\)
Validation set: \(\mathcal{D}_\text{valid}\)

Define regularized empirical risk

\[ R_\lambda (\theta, \mathcal{D}) = \frac 1{\vert \mathcal{D} \vert} \sum_{({\bf x, y} \in \mathcal{D})} \ell({\bf y}, f({\bf x}; {\bf \theta})) + \lambda C(\theta) \]

Optimize \(\lambda\) on training, evaluate on validation

Optimal parameters on training set: \[ \hat \theta_\lambda(\mathcal{D}_\text{train}) = \underset{\bf \theta}{\arg\min} R_\lambda(\theta, \mathcal{D}_\text{train}) \]
Compute the validation risk (i.e., estimate of the population risk): \[ R_\lambda^\text{val} = R_0(\hat\theta_\lambda(\mathcal{D}_\text{train}), \mathcal{D}_\text{valid}) \]
Pick the regularizer with best validation performance: \[ \lambda^* = \underset{\lambda \in \mathcal{S}}{\arg\min} R_\lambda^\text{val} \]

Cross-validation¹

Leaving out part of the data for validation is painful when the dataset is small
Simple and popular solution: cross validation (CV)
Train on all the folds but the \(k\)’th, and test on the \(k\)’th, in a round-robin fashion

Early stopping¹

Very easy: monitor performance on the validation set, stop when the loss shows signs of overfitting.

Regularization in linear regression

Polynomial regression example¹

\[ f(x; {\bf w}) = \sum_{d=0}^D w_d x^d = {\bf w}^\intercal[1, x, x^2, \dots, x^D] \]

Ridge (\(\ell_2\)) regularization¹

Strategy: Penalize the magnitude of the weights (regression coefficients). Use a zero-mean Gausian prior, \(p({\bf w}) = \mathcal{N}({\bf w} | {\bf 0}, \lambda^{-1} \mathbb{I})\). MAP estimate of the weights is given by \[ \boxed{ \hat {\bf w}_\text{map} = \underset{\bf w}{\arg\min} \left[\text{NLL}({\bf w}) + \lambda \Vert{\bf w}\Vert_2^2\right] } \] is called \(\ell_2\) regularization or weight decay.

In the case of linear regression, this scheme is called ridge regression.

\[ \Vert {\bf w} \Vert_2 = \sqrt{\sum_{d=1}^D \vert w_d\vert^2} = \sqrt{{\bf w}^\intercal {\bf w}} \] where we do not penalize the offset term \(w_0\), since that only affects the global mean and does not contribute to overfitting.

Ridge regression¹

MAP estimate is given by the penalized objective \[ \mathcal{L}_\lambda^\text{ridge}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_2^2 \]

The derivative is given by² \[ \nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 2({\bf X^\intercal Xw - X^\intercal y} + \lambda {\bf w}) \] solving \(\nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 0\) yields the optimal weights \[ \boxed{ \hat {\bf w}_\text{map} = ({\bf X^\intercal X + \lambda \mathbb{I}})^{-1} {\bf X^\intercal y} } \]

Ridge (\(\ell_2\)) regularization

Recall Bayesian linear regression with Gaussian prior \[ \begin{align*} p({\bf w} | {\bf X}, {\bf y}, \sigma^2) &\propto \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) = \mathcal{N}({\bf w} | \tilde{\bf w}, \tilde{\bf \Sigma}) \\ \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y}) \\ \tilde{\bf \Sigma} &= \left({\bf \breve{\Sigma}}^{-1} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf X}\right)^{-1} \end{align*} \]

Set simple prior: \({\bf \breve{w}} = {\bf 0}\), \({\bf \breve{\Sigma}} = \tau^2 {\bf I}\). Define \(\lambda = \sigma^2 / \tau^2\).

Posterior mean becomes: \[ \begin{align*} \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y})\\ &= \frac 1{\sigma^2} \tilde{\bf \Sigma} {\bf X}^\intercal{\bf y} \\ &= ({\bf X}^\intercal{\bf X} + \lambda {\bf I})^{-1} {\bf X}^\intercal{\bf y} \end{align*} \]

Solving RSS using QR decomposition¹

Recall the OLS solution \[ {\bf \hat w} = {\bf X}^\dagger{\bf y} = ({\bf X}^\intercal{\bf X})^{-1} {\bf X}^\intercal {\bf y} \] Inverting \({\bf X}^\intercal{\bf X}\) is dangerous: may be ill conditioned or singular. Instead, let \({\bf X} = {\bf QR}\), where \({\bf Q}^\intercal{\bf Q} = {\bf I}\). OLS is equivalent to solving the system of linear equations \({\bf Xw} = {\bf y}\), such that \[ \begin{align*} ({\bf QR}){\bf w} &= {\bf y}\\ {\bf Q^\intercal QR}{\bf w} &= {\bf Q^\intercal y} \\ w &= {\bf R}^{-1} ({\bf Q}^\intercal {\bf y}) \end{align*} \] because \({\bf R}\) is upper triangular, we can solve this last set of equations using backsubstitution, thus avoiding matrix inversion.²

Solving ridge using QR decomposition

Emulate the prior by adding “virtual data” to the training set \[ {\bf \tilde X} = \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix}, \quad {\bf \tilde y} = \begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}_{D \times 1} \end{pmatrix} \] where \({\bf \Lambda} = (1/\tau^2) {\bf I}\).

One can show that the RSS on this expanded data is equivalent to penalized RSS on the original data: \[ \begin{align*} f({\bf w}) &= ({\bf y} - {\bf \tilde{X}}{\bf w})^\intercal ({\bf y} - {\bf \tilde{X}}{\bf w})\\ &= \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right)^\intercal \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right) \\ &= \frac 1{\sigma^2} ({\bf y} - {\bf X}{\bf w})^\intercal ({\bf y} - {\bf X}{\bf w}) + {\bf w}^\intercal {\bf \Lambda}{\bf w} \end{align*} \] can be solved using OLS methods. Compute the QR decomposition of \({\bf \tilde{X}}\). Takes \(O((N+D)D^2)\) time.

Lasso regression¹

Ridge: Gaussian prior encouraged the norm of the parameters to be small

Lasso: We seek a sparse solution, i.e., most weights should be exactly 0

Lasso

Minimizing the \(\ell_0\)-norm \[ \Vert{\bf w}\Vert_0 = \sum_{d=1}^D \mathbb{I} (\vert w_d \vert > 0) \] performs feature selection (unfortunately, it’s not convex).

MAP estimation with a Laplace prior¹

Laplace distribution

Laplace distribution for the prior \[ p({\bf w} | \lambda) = \prod_{d=1}^D {\rm e}^{-\lambda \vert w_d \vert} \] puts more density on 0 than \(\mathcal{N}({\bf w} | 0, \sigma^2)\).

MAP estimate with a Laplace prior is called \(\ell_1\) regularization² \[ \mathcal{L}_\lambda^\text{lasso}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_1 \]

Why \(\ell_1\) regularization yields sparse solutions¹

Left: \(\ell_1\) regularization of a least-squares problem²
Right: \(\ell_2\) regularization of a least-squares problem

Elastic net (ridge and lasso combined)¹

Combination of ridge and lasso can offer an implicit grouping of correlated variables \[ \mathcal{L}({\bf w}, \lambda_1, \lambda_2) = \Vert {\bf y - Xw}\Vert^2 + \lambda_2 \Vert {\bf w} \Vert_2^2 + \lambda_1 \Vert {\bf w} \Vert_1 \]

Robust linear regression: Huber loss¹

Define the Huber loss \[ \ell_\text{huber}(r, \delta) = \left\{ \begin{align*} r^2/2, & \text{ if } |r| \leq \delta \\ \delta |r| - \delta^2 / 2, & \text{ if } |r| > \delta \end{align*}\right. \] is equivalent to \(\ell_2\) for errors smaller than \(\delta\) and is equivalent to \(\ell_1\) for larger errors.

Advantage: everywhere differentiable. Faster than using the Laplace likelihood, because we can replace linear programming by smooth optimization methods (e.g., SGD).

Flavors of linear regression

Note

Likelihood: \(p(y | {\bf x}, {\bf w}, \sigma^2)\); Prior: \(p({\bf w})\); Posterior: \(p({\bf w} | \mathcal{D})\)
Point: degenerate distribution, \(\delta({\bf w} - \hat{\bf w})\), where \(\hat {\bf w}\) is the MAP estimate
MLE is equivalent to using a point posterior and a uniform prior

Summary

Generalization: Dangers of overfitting and underfitting are always present
Maximum-a-posteriori (MAP) estimate: Regularizes the negative log-likelihood by the log of the prior
Cross-validation: Common technique to choose regularization parameter
Regularization in linear regression: Ridge (\(\ell_2\)) and Lasso (\(\ell_0\) or \(\ell_1\)) constrain the strength and the number of weights, respectively

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Regularization in statistics

Motivation: Generalization1

Motivation: Generalization

What happened?

Solution: Regularization

Motivation: Polynomial fitting1

Maximum-a-posteriori (MAP) estimation1

Determine regularizer using a validation set1

Split the data in two disjoint sets!2

Define regularized empirical risk

Optimize \(\lambda\) on training, evaluate on validation

Cross-validation1

Early stopping1

Regularization in linear regression

Polynomial regression example1

Ridge (\(\ell_2\)) regularization1

Ridge regression1

Ridge (\(\ell_2\)) regularization

Solving RSS using QR decomposition1

Solving ridge using QR decomposition

Lasso regression1

MAP estimation with a Laplace prior1

Why \(\ell_1\) regularization yields sparse solutions1

Elastic net (ridge and lasso combined)1

Robust linear regression: Huber loss1

Flavors of linear regression

Summary

Summary

References

Motivation: Generalization¹

Motivation: Polynomial fitting¹

Maximum-a-posteriori (MAP) estimation¹

Determine regularizer using a validation set¹

Split the data in two disjoint sets!²

Cross-validation¹

Early stopping¹

Polynomial regression example¹

Ridge (\(\ell_2\)) regularization¹

Ridge regression¹

Solving RSS using QR decomposition¹

Lasso regression¹

MAP estimation with a Laplace prior¹

Why \(\ell_1\) regularization yields sparse solutions¹

Elastic net (ridge and lasso combined)¹

Robust linear regression: Huber loss¹