Lecture 7: Linear regression
Institute for Theoretical Physics, Heidelberg University
Why?
Expected value of the output is assumed to be a linear function of the input \[ E[y|{\bf x}] = w_0 + {\bf w}^\intercal {\bf x} \]
To estimate the weights, the NLL is equal1 to the residual sum of squares \[ \text{RSS}({\bf w}) = \frac 12 \sum_{n=1}^N (y_n-{\bf w}^\intercal {\bf x}_n)^2 = \frac 12 \Vert{\bf Xw - y}\Vert_2^2 \]
In the following we discuss how to optimize this.
Functions that map vectors to scalars2
\[ \frac{\partial({\bf a}^\intercal {\bf x})}{\partial {\bf x}} = {\bf a} \] \[ \frac{\partial({\bf x}^\intercal {\bf A}{\bf x})}{\partial {\bf x}} = ({\bf A} + {\bf A}^\intercal) {\bf x} \]
\[ \text{RSS}({\bf w}) = \frac 12 \Vert{\bf Xw - y}\Vert_2^2 = \frac 12 ({\bf Xw - y})^\intercal ({\bf Xw - y}) \] Let’s compute the gradient by making use of the two identities \[ \begin{align*} \nabla_{{\bf w}}\text{RSS}({\bf w}) &= \frac 12 \nabla_{{\bf w}} \left[ ({\bf Xw})^\intercal ({\bf Xw}) - 2({\bf Xw})^\intercal {\bf y} \right] \\ &= \frac 12 \nabla_{{\bf w}} {\bf w^\intercal( X^\intercal X) w} - \nabla_{\bf w} {\bf ( (X^\intercal)^\intercal w)^\intercal y} \\ &= \frac 12 ({\bf X^\intercal X} + ({\bf X^\intercal X})^\intercal){\bf w} - {\bf X^\intercal y} \\ &= {\bf X^\intercal Xw - X^\intercal y} \end{align*} \] Setting the gradient to zero leads to the normal equations \[ \boxed{ {\bf X^\intercal Xw = X^\intercal y} } \]
The solution to the normal equations \({\bf X^\intercal Xw = X^\intercal y}\) leads to the ordinary least squares (OLS) solution \[ \hat{\bf w} = \underbrace{{\bf (X^\intercal X)}^{-1}{\bf X^\intercal}}_{{\bf X}^\dagger} {\bf y} \] where \({\bf X}^\dagger\) is the (left) pseudo inverse of the (non-square) matrix \({\bf X}\).
\[ {\bf H(w)} = \frac{\partial^2}{\partial {\bf w}^2}\text{RSS}({\bf w}) = {\bf X^\intercal X} \] If \({\bf X}\) is full rank (i.e., the columns of \({\bf X}\) are linearly independent), then for any \({\bf v > 0}\) \[ {\bf v}^\intercal ({\bf X^\intercal X}){\bf v} = ({\bf Xv})^\intercal ({\bf Xv}) = \Vert{\bf Xv}\Vert^2 > 0 \] In the full-rank case, \({\bf H}\) is positive definite. OLS has a unique global minimum.
Contours of the RSS error surface. Blue cross is the MLE.
Sometimes the variance may depend on the input.2 In this case we may want to associate a specific weight to each example \[ \begin{align*} p(y | {\bf x}, {\bf \theta}) &= \mathcal{N}(y | {\bf w}^\intercal {\bf x}, \sigma^2({\bf x})) \\ &= \frac 1{\sqrt{2\pi\sigma^2({\bf x})}} \exp\left(-\frac 1{2\sigma^2({\bf x})}(y - {\bf w}^\intercal {\bf x})^2\right) \\ &= \mathcal{N}(y | {\bf Xw}, {\bf \Lambda}^{-1}) \end{align*} \] where \({\bf \Lambda} = \text{diag}(1/\sigma^2({\bf x}_n))\).
The maximum likelihood estimate yields the weighted least-squares estimate \[ \boxed{ \hat{\bf w} = ({\bf X^\intercal \Lambda X})^{-1} {\bf X^\intercal \Lambda y} } \]
After estimating \(\hat{\bf w}\) using MLE, we can estimate the noise variance. \[ \text{NLL} = \frac 1{2\sigma^2} \sum_{n=1}^N (y_n-{\bf w}^\intercal {\bf x}_n)^2 + \frac N2 \log(2\pi\sigma^2) \]
\[ \nabla_{\sigma} \text{NLL} \overset{!}{=}0 \]
\[ \begin{align*} -\frac 1{\hat \sigma^3} \sum_{n=1}^N (y_n - {\bf w}^\intercal {\bf x}_n)^2 + \frac N{\hat \sigma} = 0 \\ \hat \sigma = \frac 1N \sum_{n=1}^N (y_n - {\bf w}^\intercal {\bf x}_n)^2 \end{align*} \] is the MSE of the residuals—an intuitive result.
Residuals \(r_n = y_n - \hat y_n\) vs the input \(x_n\). Should follow \(\mathcal{N}(0, \sigma^2)\).
Predictions \(\hat y_n\) vs the true output \(y_n\). Points should be on the diagonal.
Compute the residual sum of squares on the dataset (the smaller the better) \[ \text{RSS}({\bf w}) = \sum_{n=1}^N (y_n - {\bf w^\intercal x}_n)^2 \]
RMSE is a more common measure to quantify accuracy \[ \text{RMSE}({\bf w}) = \sqrt{\frac 1N \text{RSS}({\bf w})} = \sqrt{\frac 1N \sum_{n=1}^N (y_n - {\bf w^\intercal x}_n)^2} \]
More interpretable measure \[ R^2 = 1 - \frac{\sum_{n=1}^N (\hat y_n - y_n)^2}{\sum_{n=1}^N (\bar y_n - y_n)^2} = 1 - \frac{\text{RSS}}{\text{TSS}} \] where \(\bar y = \frac 1N \sum_{n=1}^N y_n\) is the empirical mean, RSS is the residual sum of squares, and TSS is the total sum of squares. \(R^2\) measures the variance in the predictions relative to a simple constant prediction, \(\bar y\).
Note
\(0 \leq R^2 \leq 1\), and larger values imply better fit (see figure in previous slide).
What happens when the data contains outliers?
\[ p(y | {\bf x}, {\bf w}, b) = \text{Laplace}(y | {\bf w}^\intercal{\bf x}, b) \propto \exp\left( -\frac 1b |y - {\bf w}^\intercal{\bf x}|\right) \] MLE can be computed using linear programming.1
\[ p(y | {\bf x}, {\bf w}, \sigma^2, \nu) = \left[ 1 + \frac 1\nu \left(\frac{y-\mu}\sigma \right)^2 \right]^{-\frac{\nu+1}2} \] Model can be fit using stochastic gradient descent.
Mix of \(\ell_2\) and \(\ell_1\) regularizations at short and long ranges, respectively.2
Recall multivariate normal likelihood (iid) with known variance \[ p(\mathcal{D} | {\bf w}, \sigma^2) = \prod_{n=1}^N p(y_n | {\bf w}^\intercal{\bf x}, \sigma^2) = \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \] Compute the posterior over the parameters, \(p({\bf w} | \mathcal{D}, \sigma^2)\). Assume further a Gaussian prior \[ p({\bf w}) = \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) \] Use Bayes’ rule \[ \begin{align*} p({\bf w} | {\bf X}, {\bf y}, \sigma^2) &\propto \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) = \mathcal{N}({\bf w} | \tilde{\bf w}, \tilde{\bf \Sigma}) \\ \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y}) \\ \tilde{\bf \Sigma} &= \left({\bf \breve{\Sigma}}^{-1} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf X}\right)^{-1} \end{align*} \] where \(\tilde{\bf w}\) and \(\tilde{\bf \Sigma}\) are the posterior mean and covariance, respectively.1
illustrates how the model “learns” about the parameters as it sees more data.
Note
Generalized linear model (GLM)
GLM generalizes this linearity property. GLM is a conditional version of an exponential-family distribution. The natural parameters are a linear function of the input \[ p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = \exp\left[ \frac{y_n \eta_n - A(\eta_n)}{\sigma^2} + \log h(y_n, \sigma^2) \right] \] where
For GLMs, \(T(y) = y\). One can then show \[ \begin{align*} E[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= A'(\eta_n) = \ell^{-1}(\eta_n) \\ \text{Var}[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= A''(\eta_n) \sigma^2 \end{align*} \] where \(\ell^{-1}\) is called the mean function.1
Linear regression
\[ \begin{align*} p(y_n | {\bf x}_n, {\bf w}, \sigma^2) &= \frac 1{\sqrt{2\pi\sigma^2}} \exp\left(-\frac 1{2\sigma^2}(y_n - {\bf w}^\intercal {\bf x}_n)^2\right) \\ \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) &= -\frac 1{2\sigma^2}(y_n-\eta_n)^2 - \frac 12 \log(2\pi\sigma^2) \end{align*} \]
where \(\eta_n = {\bf w}^\intercal {\bf x}_n\), can be rewritten in GLM form \[ \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = \frac{y_n \eta_n - \frac{\eta_n^2}{2}}{\sigma^2} - \frac 12\left(\frac{y_n^2}{\sigma^2} + \log(2\pi\sigma^2) \right) \] We see that \(A(\eta_n) = \eta_n^2/2\) and thus \[ \begin{align*} E[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= \eta_n = {\bf w^\intercal x}_n \\ \text{Var}[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= \sigma^2 \end{align*} \]
Binomial regression
Recall the binomial distribution for a binary process with \(N\) experiments. Model the probability of success with a linear model,1 \(\mu_n = \mu_n({\bf w}^\intercal{\bf x}_n)\). \[ \begin{align*} p(y_n | {\bf x}_n, N_n, {\bf w}_n) &= \text{Bin}(y_n | \mu_n({\bf w}^\intercal {\bf x}_n), N_n) \\ \log p(y_n | {\bf x}_n, N_n, {\bf w}) &= y_n \log \mu_n + (N_n - y_n)\log(1-\mu_n) + \log {N_n \choose y_n} \\ &= y_n \log \left( \frac{\mu_n}{1-\mu_n} \right) + N_n \log(1 - \mu_n) + \log {N_n \choose y_n} \\ &= y_n \eta_n - A(\eta_n) + h(y_n) \end{align*} \] where \(\eta_n = {\bf w}^\intercal{\bf x}_n\), \(A(\eta_n) = -N_n \log(1 - \mu_n)\), and \(h(y_n) = \log {N_n \choose y_n}\). This shows that binomial regression can be written in GLM form.
We further have the moments \[ \begin{align*} E[y_n | {\bf x}_n, N_n, {\bf w}_n] &= \frac{\text{d}A}{\text{d}\eta_n} = \frac{\text{d}N_n \log(1 + \text{e}^{\eta_n})}{\text{d}\eta_n} = N_n \mu_n \\ \text{Var}[y_n | {\bf x}_n, N_n, {\bf w}_n] &= \frac{\text{d}^2A}{\text{d}^2\eta_n} = N_n \mu_n(1-\mu_n) \end{align*} \]
Negative log-likelihood
Consider the negative log-likelihood for the GLM \[ \text{NLL}({\bf w}) = - \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = - \frac 1{\sigma^2} \sum_{n=1}^N \left(\eta_n y_n - A(\eta_n)\right) \] where \(\eta_n = {\bf w^\intercal x}_n\).
\[ \begin{align*} \frac{\partial}{\partial {\bf w}} \text{NLL}({\bf w}) &= - \frac 1{\sigma^2} \sum_{n=1}^N \frac{\partial}{\partial \eta_n} (\eta_n y_n - A(\eta_n)) \frac{\partial \eta_n}{\partial {\bf w}} \\ &= -\frac 1{\sigma^2} \sum_{n=1}^N (y_n - A'(\eta_n)) {\bf x}_n \end{align*} \]
\[ {\bf H} = \frac{\partial}{\partial{\bf w}\partial{\bf w}^\intercal}\text{NLL}({\bf w}) = \frac 1{\sigma^2}\sum_{n=1}^N A''(\eta_n){\bf x}_n {\bf x}_n^\intercal \] is positive definite, so the MLE for a GLM is unique!