Lecture 8: Regularization
Institute for Theoretical Physics, Heidelberg University
Example: Probability of heads when tossing a coin
The model assumes that the empirical distribution (used for testing) is the same as the true distribution. Putting all the probability mass on the observed set of \(N\) examples will not leave over any probability for novel data in the future.
Add a penalty term to the NLL \[ \mathcal{L}({\bf\theta}; \lambda) = \left[ \frac 1N \sum_{n=1}^N \ell(y_n, {\bf \theta}; {\bf x}_n) \right] + \lambda C(\theta) \] where \(\lambda \geq 0\) is a regularization parameter, and \(C(\theta)\) is some form of complexity penalty.
Maximum-a-posteriori (MAP) estimation can be seen as a regularization of MLE.
Difficulties in choosing the strength of the regularizer, \(\lambda\)
\[ R_\lambda (\theta, \mathcal{D}) = \frac 1{\vert \mathcal{D} \vert} \sum_{({\bf x, y} \in \mathcal{D})} \ell({\bf y}, f({\bf x}; {\bf \theta})) + \lambda C(\theta) \]
Very easy: monitor performance on the validation set, stop when the loss shows signs of overfitting.
\[ f(x; {\bf w}) = \sum_{d=0}^D w_d x^d = {\bf w}^\intercal[1, x, x^2, \dots, x^D] \]
In the case of linear regression, this scheme is called ridge regression.
\[ \Vert {\bf w} \Vert_2 = \sqrt{\sum_{d=1}^D \vert w_d\vert^2} = \sqrt{{\bf w}^\intercal {\bf w}} \] where we do not penalize the offset term \(w_0\), since that only affects the global mean and does not contribute to overfitting.
MAP estimate is given by the penalized objective \[ \mathcal{L}_\lambda^\text{ridge}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_2^2 \]
The derivative is given by2 \[ \nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 2({\bf X^\intercal Xw - X^\intercal y} + \lambda {\bf w}) \] solving \(\nabla_{\bf w} \mathcal{L}_\lambda^\text{ridge}({\bf w}) = 0\) yields the optimal weights \[ \boxed{ \hat {\bf w}_\text{map} = ({\bf X^\intercal X + \lambda \mathbb{I}})^{-1} {\bf X^\intercal y} } \]
Recall Bayesian linear regression with Gaussian prior \[ \begin{align*} p({\bf w} | {\bf X}, {\bf y}, \sigma^2) &\propto \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) = \mathcal{N}({\bf w} | \tilde{\bf w}, \tilde{\bf \Sigma}) \\ \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y}) \\ \tilde{\bf \Sigma} &= \left({\bf \breve{\Sigma}}^{-1} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf X}\right)^{-1} \end{align*} \]
Set simple prior: \({\bf \breve{w}} = {\bf 0}\), \({\bf \breve{\Sigma}} = \tau^2 {\bf I}\). Define \(\lambda = \sigma^2 / \tau^2\).
Posterior mean becomes: \[ \begin{align*} \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y})\\ &= \frac 1{\sigma^2} \tilde{\bf \Sigma} {\bf X}^\intercal{\bf y} \\ &= ({\bf X}^\intercal{\bf X} + \lambda {\bf I})^{-1} {\bf X}^\intercal{\bf y} \end{align*} \]
Recall the OLS solution \[ {\bf \hat w} = {\bf X}^\dagger{\bf y} = ({\bf X}^\intercal{\bf X})^{-1} {\bf X}^\intercal {\bf y} \] Inverting \({\bf X}^\intercal{\bf X}\) is dangerous: may be ill conditioned or singular. Instead, let \({\bf X} = {\bf QR}\), where \({\bf Q}^\intercal{\bf Q} = {\bf I}\). OLS is equivalent to solving the system of linear equations \({\bf Xw} = {\bf y}\), such that \[ \begin{align*} ({\bf QR}){\bf w} &= {\bf y}\\ {\bf Q^\intercal QR}{\bf w} &= {\bf Q^\intercal y} \\ w &= {\bf R}^{-1} ({\bf Q}^\intercal {\bf y}) \end{align*} \] because \({\bf R}\) is upper triangular, we can solve this last set of equations using backsubstitution, thus avoiding matrix inversion.2
Emulate the prior by adding “virtual data” to the training set \[ {\bf \tilde X} = \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix}, \quad {\bf \tilde y} = \begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}_{D \times 1} \end{pmatrix} \] where \({\bf \Lambda} = (1/\tau^2) {\bf I}\).
One can show that the RSS on this expanded data is equivalent to penalized RSS on the original data: \[ \begin{align*} f({\bf w}) &= ({\bf y} - {\bf \tilde{X}}{\bf w})^\intercal ({\bf y} - {\bf \tilde{X}}{\bf w})\\ &= \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right)^\intercal \left(\begin{pmatrix} {\bf y} / \sigma \\ {\bf 0}\end{pmatrix} - \begin{pmatrix} {\bf X} / \sigma \\ \sqrt{{\bf \Lambda}} \end{pmatrix} {\bf w} \right) \\ &= \frac 1{\sigma^2} ({\bf y} - {\bf X}{\bf w})^\intercal ({\bf y} - {\bf X}{\bf w}) + {\bf w}^\intercal {\bf \Lambda}{\bf w} \end{align*} \] can be solved using OLS methods. Compute the QR decomposition of \({\bf \tilde{X}}\). Takes \(O((N+D)D^2)\) time.
Lasso
Minimizing the \(\ell_0\)-norm \[ \Vert{\bf w}\Vert_0 = \sum_{d=1}^D \mathbb{I} (\vert w_d \vert > 0) \] performs feature selection (unfortunately, it’s not convex).
Laplace distribution
Laplace distribution for the prior \[ p({\bf w} | \lambda) = \prod_{d=1}^D {\rm e}^{-\lambda \vert w_d \vert} \] puts more density on 0 than \(\mathcal{N}({\bf w} | 0, \sigma^2)\).
MAP estimate with a Laplace prior is called \(\ell_1\) regularization2 \[ \mathcal{L}_\lambda^\text{lasso}({\bf w}) = ({\bf y - Xw})^\intercal({\bf y - Xw}) + \lambda \Vert {\bf w} \Vert_1 \]
Combination of ridge and lasso can offer an implicit grouping of correlated variables \[ \mathcal{L}({\bf w}, \lambda_1, \lambda_2) = \Vert {\bf y - Xw}\Vert^2 + \lambda_2 \Vert {\bf w} \Vert_2^2 + \lambda_1 \Vert {\bf w} \Vert_1 \]
Define the Huber loss \[ \ell_\text{huber}(r, \delta) = \left\{ \begin{align*} r^2/2, & \text{ if } |r| \leq \delta \\ \delta |r| - \delta^2 / 2, & \text{ if } |r| > \delta \end{align*}\right. \] is equivalent to \(\ell_2\) for errors smaller than \(\delta\) and is equivalent to \(\ell_1\) for larger errors.
Advantage: everywhere differentiable. Faster than using the Laplace likelihood, because we can replace linear programming by smooth optimization methods (e.g., SGD).
Note