Computational Statistics & Data Analysis (MVComp2)

Lecture 7: Linear regression

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

  • Chapters 11 and 12 of Murphy (2022)

Recap from last time

Parameter estimation
Distribution of the sample mean, sample variance
PDF of common statistics
Expected distribution for mean, variance, standardized distribution, ratio of variances
Confidence regions
Put bounds on statistic
Hypothesis testing
Try to reject the null hypothesis up to desired confidence
Non-parametric tests
Pearson \(\chi^2\) for binned data, bootstrap

Least-squares

Introduction: Linear regression1

Linear regression
Model of the following form \[ p(y | {\bf x}, {\bf \theta}) = \mathcal{N}(y | w_0 + {\bf w}^\intercal {\bf x}, \sigma^2) \] where \({\bf \theta} = (w_0, {\bf w}, \sigma^2)\) are the model parameters.2

Why?

Expected value of the output is assumed to be a linear function of the input \[ E[y|{\bf x}] = w_0 + {\bf w}^\intercal {\bf x} \]

Introduction: Linear regression1

Linear regression beyond straight lines
Apply a nonlinear transformation to the input features, \({\bf \phi}\), s.t. the model remains linear in the parameters \[ p(y | {\bf x}, {\bf \theta}) = \mathcal{N}(y | {\bf w}^\intercal {\bf \phi}({\bf x}), \sigma^2) \]

Least-squares estimation1

Model fit
Minimize the negative log-likelihood on the training set \[ \begin{align*} \text{NLL} &= - \sum_{n=1}^N \log \left[ \left(\frac 1{2\pi\sigma^2} \right)^{\frac 12} \exp\left(-\frac 1{2\sigma^2}(y_n - {\bf w}^\intercal {\bf x}_n)^2\right) \right] \\ &= \frac 1{2\sigma^2} \sum_{n=1}^N (y_n-{\bf w}^\intercal {\bf x}_n)^2 + \frac N2 \log(2\pi\sigma^2) \end{align*} \]

Least-squares estimation

Maximum likelihood estimation
the point where \(\nabla_{{\bf w}, \sigma}\text{NLL}({\bf w}, \sigma^2) = {\bf 0}\)

To estimate the weights, the NLL is equal1 to the residual sum of squares \[ \text{RSS}({\bf w}) = \frac 12 \sum_{n=1}^N (y_n-{\bf w}^\intercal {\bf x}_n)^2 = \frac 12 \Vert{\bf Xw - y}\Vert_2^2 \]

In the following we discuss how to optimize this.

Ordinary least squares1

Functions that map vectors to scalars2

\[ \frac{\partial({\bf a}^\intercal {\bf x})}{\partial {\bf x}} = {\bf a} \] \[ \frac{\partial({\bf x}^\intercal {\bf A}{\bf x})}{\partial {\bf x}} = ({\bf A} + {\bf A}^\intercal) {\bf x} \]

\[ \text{RSS}({\bf w}) = \frac 12 \Vert{\bf Xw - y}\Vert_2^2 = \frac 12 ({\bf Xw - y})^\intercal ({\bf Xw - y}) \] Let’s compute the gradient by making use of the two identities \[ \begin{align*} \nabla_{{\bf w}}\text{RSS}({\bf w}) &= \frac 12 \nabla_{{\bf w}} \left[ ({\bf Xw})^\intercal ({\bf Xw}) - 2({\bf Xw})^\intercal {\bf y} \right] \\ &= \frac 12 \nabla_{{\bf w}} {\bf w^\intercal( X^\intercal X) w} - \nabla_{\bf w} {\bf ( (X^\intercal)^\intercal w)^\intercal y} \\ &= \frac 12 ({\bf X^\intercal X} + ({\bf X^\intercal X})^\intercal){\bf w} - {\bf X^\intercal y} \\ &= {\bf X^\intercal Xw - X^\intercal y} \end{align*} \] Setting the gradient to zero leads to the normal equations \[ \boxed{ {\bf X^\intercal Xw = X^\intercal y} } \]

Ordinary least squares: Solution

The solution to the normal equations \({\bf X^\intercal Xw = X^\intercal y}\) leads to the ordinary least squares (OLS) solution \[ \hat{\bf w} = \underbrace{{\bf (X^\intercal X)}^{-1}{\bf X^\intercal}}_{{\bf X}^\dagger} {\bf y} \] where \({\bf X}^\dagger\) is the (left) pseudo inverse of the (non-square) matrix \({\bf X}\).

Ordinary least squares: Uniqueness of solution

Strategy
Check that the Hessian is positive definite

\[ {\bf H(w)} = \frac{\partial^2}{\partial {\bf w}^2}\text{RSS}({\bf w}) = {\bf X^\intercal X} \] If \({\bf X}\) is full rank (i.e., the columns of \({\bf X}\) are linearly independent), then for any \({\bf v > 0}\) \[ {\bf v}^\intercal ({\bf X^\intercal X}){\bf v} = ({\bf Xv})^\intercal ({\bf Xv}) = \Vert{\bf Xv}\Vert^2 > 0 \] In the full-rank case, \({\bf H}\) is positive definite. OLS has a unique global minimum.

Ordinary least squares: Uniqueness of solution

Contours of the RSS error surface. Blue cross is the MLE.

Weighted least squares1

Sometimes the variance may depend on the input.2 In this case we may want to associate a specific weight to each example \[ \begin{align*} p(y | {\bf x}, {\bf \theta}) &= \mathcal{N}(y | {\bf w}^\intercal {\bf x}, \sigma^2({\bf x})) \\ &= \frac 1{\sqrt{2\pi\sigma^2({\bf x})}} \exp\left(-\frac 1{2\sigma^2({\bf x})}(y - {\bf w}^\intercal {\bf x})^2\right) \\ &= \mathcal{N}(y | {\bf Xw}, {\bf \Lambda}^{-1}) \end{align*} \] where \({\bf \Lambda} = \text{diag}(1/\sigma^2({\bf x}_n))\).

The maximum likelihood estimate yields the weighted least-squares estimate \[ \boxed{ \hat{\bf w} = ({\bf X^\intercal \Lambda X})^{-1} {\bf X^\intercal \Lambda y} } \]

MLE for \(\sigma^2\)

After estimating \(\hat{\bf w}\) using MLE, we can estimate the noise variance. \[ \text{NLL} = \frac 1{2\sigma^2} \sum_{n=1}^N (y_n-{\bf w}^\intercal {\bf x}_n)^2 + \frac N2 \log(2\pi\sigma^2) \]

\[ \nabla_{\sigma} \text{NLL} \overset{!}{=}0 \]

\[ \begin{align*} -\frac 1{\hat \sigma^3} \sum_{n=1}^N (y_n - {\bf w}^\intercal {\bf x}_n)^2 + \frac N{\hat \sigma} = 0 \\ \hat \sigma = \frac 1N \sum_{n=1}^N (y_n - {\bf w}^\intercal {\bf x}_n)^2 \end{align*} \] is the MSE of the residuals—an intuitive result.

Measuring goodness of fit

Residual and parity plots1

Residual plot

Residuals \(r_n = y_n - \hat y_n\) vs the input \(x_n\). Should follow \(\mathcal{N}(0, \sigma^2)\).

Parity plot

Predictions \(\hat y_n\) vs the true output \(y_n\). Points should be on the diagonal.

Prediction accuracy and \(R^2\)1

Quantify accuracy

Compute the residual sum of squares on the dataset (the smaller the better) \[ \text{RSS}({\bf w}) = \sum_{n=1}^N (y_n - {\bf w^\intercal x}_n)^2 \]

Root mean-squared error (RMSE)

RMSE is a more common measure to quantify accuracy \[ \text{RMSE}({\bf w}) = \sqrt{\frac 1N \text{RSS}({\bf w})} = \sqrt{\frac 1N \sum_{n=1}^N (y_n - {\bf w^\intercal x}_n)^2} \]

Coefficient of determination (\(R^2\))

More interpretable measure \[ R^2 = 1 - \frac{\sum_{n=1}^N (\hat y_n - y_n)^2}{\sum_{n=1}^N (\bar y_n - y_n)^2} = 1 - \frac{\text{RSS}}{\text{TSS}} \] where \(\bar y = \frac 1N \sum_{n=1}^N y_n\) is the empirical mean, RSS is the residual sum of squares, and TSS is the total sum of squares. \(R^2\) measures the variance in the predictions relative to a simple constant prediction, \(\bar y\).

Note

\(0 \leq R^2 \leq 1\), and larger values imply better fit (see figure in previous slide).

Robust linear regression

What happens when the data contains outliers?

What happens?
Squared error penalizes deviations quadratically, so points far from the line have more effect than points near the line.
Strategy
Replace Gaussian with a distribution that has heavy tails

Robust linear regression: Multiple strategies

Laplace likelihood

\[ p(y | {\bf x}, {\bf w}, b) = \text{Laplace}(y | {\bf w}^\intercal{\bf x}, b) \propto \exp\left( -\frac 1b |y - {\bf w}^\intercal{\bf x}|\right) \] MLE can be computed using linear programming.1

Student-\(t\) likelihood

\[ p(y | {\bf x}, {\bf w}, \sigma^2, \nu) = \left[ 1 + \frac 1\nu \left(\frac{y-\mu}\sigma \right)^2 \right]^{-\frac{\nu+1}2} \] Model can be fit using stochastic gradient descent.

Huber loss

Mix of \(\ell_2\) and \(\ell_1\) regularizations at short and long ranges, respectively.2

Bayesian linear regression

Recall multivariate normal likelihood (iid) with known variance \[ p(\mathcal{D} | {\bf w}, \sigma^2) = \prod_{n=1}^N p(y_n | {\bf w}^\intercal{\bf x}, \sigma^2) = \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \] Compute the posterior over the parameters, \(p({\bf w} | \mathcal{D}, \sigma^2)\). Assume further a Gaussian prior \[ p({\bf w}) = \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) \] Use Bayes’ rule \[ \begin{align*} p({\bf w} | {\bf X}, {\bf y}, \sigma^2) &\propto \mathcal{N}({\bf y} | {\bf Xw}, \sigma^2 {\bf I}_N) \mathcal{N}({\bf w} | {\bf \breve{w}}, {\bf \breve{\Sigma}}) = \mathcal{N}({\bf w} | \tilde{\bf w}, \tilde{\bf \Sigma}) \\ \tilde{\bf w} &= \tilde{\bf \Sigma} ({\bf \breve{\Sigma}}^{-1}{\bf \breve{w}} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf y}) \\ \tilde{\bf \Sigma} &= \left({\bf \breve{\Sigma}}^{-1} + \frac 1{\sigma^2} {\bf X}^\intercal{\bf X}\right)^{-1} \end{align*} \] where \(\tilde{\bf w}\) and \(\tilde{\bf \Sigma}\) are the posterior mean and covariance, respectively.1

Sequential Bayesian inference

  1. After 0 data point
  2. After 1 data point
  3. After 2 data points
  4. After 100 data points

illustrates how the model “learns” about the parameters as it sees more data.

Flavors of linear regression

Note

  • Likelihood: \(p(y | {\bf x}, {\bf w}, \sigma^2)\); Prior: \(p({\bf w})\); Posterior: \(p({\bf w} | \mathcal{D})\)
  • Point: degenerate distribution, \(\delta({\bf w} - \hat{\bf w})\), where \(\hat {\bf w}\) is the MAP estimate
  • MLE is equivalent to using a point posterior and a uniform prior

Generalized linear models

Introduction1

Linear regression
Recall \(p(y | {\bf x}, {\bf w}) = \mathcal{N}(y | {\bf w^\intercal x}, \sigma^2)\). Useful property: the mean of the output, \(E[y | {\bf x}, {\bf w}]\), is a linear function of the inputs \({\bf x}\).

Generalized linear model (GLM)

GLM generalizes this linearity property. GLM is a conditional version of an exponential-family distribution. The natural parameters are a linear function of the input \[ p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = \exp\left[ \frac{y_n \eta_n - A(\eta_n)}{\sigma^2} + \log h(y_n, \sigma^2) \right] \] where

  • \(\eta_n = {\bf w^\intercal x}_n\) is the (input dependent) natural parameter
  • \(A(\eta_n)\) is the log partition function
  • \(T(y) = y\) is the sufficient statistic
  • \(\sigma^2\) is the dispersion term

Moments of GLMs

For GLMs, \(T(y) = y\). One can then show \[ \begin{align*} E[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= A'(\eta_n) = \ell^{-1}(\eta_n) \\ \text{Var}[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= A''(\eta_n) \sigma^2 \end{align*} \] where \(\ell^{-1}\) is called the mean function.1

Example: linear regression

Linear regression

\[ \begin{align*} p(y_n | {\bf x}_n, {\bf w}, \sigma^2) &= \frac 1{\sqrt{2\pi\sigma^2}} \exp\left(-\frac 1{2\sigma^2}(y_n - {\bf w}^\intercal {\bf x}_n)^2\right) \\ \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) &= -\frac 1{2\sigma^2}(y_n-\eta_n)^2 - \frac 12 \log(2\pi\sigma^2) \end{align*} \]

where \(\eta_n = {\bf w}^\intercal {\bf x}_n\), can be rewritten in GLM form \[ \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = \frac{y_n \eta_n - \frac{\eta_n^2}{2}}{\sigma^2} - \frac 12\left(\frac{y_n^2}{\sigma^2} + \log(2\pi\sigma^2) \right) \] We see that \(A(\eta_n) = \eta_n^2/2\) and thus \[ \begin{align*} E[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= \eta_n = {\bf w^\intercal x}_n \\ \text{Var}[y_n | {\bf x}_n, {\bf w}, \sigma^2] &= \sigma^2 \end{align*} \]

Example: binomial regression

Binomial regression

Recall the binomial distribution for a binary process with \(N\) experiments. Model the probability of success with a linear model,1 \(\mu_n = \mu_n({\bf w}^\intercal{\bf x}_n)\). \[ \begin{align*} p(y_n | {\bf x}_n, N_n, {\bf w}_n) &= \text{Bin}(y_n | \mu_n({\bf w}^\intercal {\bf x}_n), N_n) \\ \log p(y_n | {\bf x}_n, N_n, {\bf w}) &= y_n \log \mu_n + (N_n - y_n)\log(1-\mu_n) + \log {N_n \choose y_n} \\ &= y_n \log \left( \frac{\mu_n}{1-\mu_n} \right) + N_n \log(1 - \mu_n) + \log {N_n \choose y_n} \\ &= y_n \eta_n - A(\eta_n) + h(y_n) \end{align*} \] where \(\eta_n = {\bf w}^\intercal{\bf x}_n\), \(A(\eta_n) = -N_n \log(1 - \mu_n)\), and \(h(y_n) = \log {N_n \choose y_n}\). This shows that binomial regression can be written in GLM form.

We further have the moments \[ \begin{align*} E[y_n | {\bf x}_n, N_n, {\bf w}_n] &= \frac{\text{d}A}{\text{d}\eta_n} = \frac{\text{d}N_n \log(1 + \text{e}^{\eta_n})}{\text{d}\eta_n} = N_n \mu_n \\ \text{Var}[y_n | {\bf x}_n, N_n, {\bf w}_n] &= \frac{\text{d}^2A}{\text{d}^2\eta_n} = N_n \mu_n(1-\mu_n) \end{align*} \]

Maximum likelihood estimation1

Negative log-likelihood

Consider the negative log-likelihood for the GLM \[ \text{NLL}({\bf w}) = - \log p(y_n | {\bf x}_n, {\bf w}, \sigma^2) = - \frac 1{\sigma^2} \sum_{n=1}^N \left(\eta_n y_n - A(\eta_n)\right) \] where \(\eta_n = {\bf w^\intercal x}_n\).

Gradient

\[ \begin{align*} \frac{\partial}{\partial {\bf w}} \text{NLL}({\bf w}) &= - \frac 1{\sigma^2} \sum_{n=1}^N \frac{\partial}{\partial \eta_n} (\eta_n y_n - A(\eta_n)) \frac{\partial \eta_n}{\partial {\bf w}} \\ &= -\frac 1{\sigma^2} \sum_{n=1}^N (y_n - A'(\eta_n)) {\bf x}_n \end{align*} \]

Hessian

\[ {\bf H} = \frac{\partial}{\partial{\bf w}\partial{\bf w}^\intercal}\text{NLL}({\bf w}) = \frac 1{\sigma^2}\sum_{n=1}^N A''(\eta_n){\bf x}_n {\bf x}_n^\intercal \] is positive definite, so the MLE for a GLM is unique!

Strategy
Fit GLMs using gradient-based solvers to reach the global minimum.

Summary

Summary

Least-squares
Linear regression is linear in the parameters, not the input features
Ordinary least squares
Linear algebra provides closed-form solution, also for heteroskedastic problems
Goodness of fit
Measure accuracy of fit using RSS or RMSE, coefficient of determination is intuitive
Generalized linear models
Family of exponential distributions have their expected value depend linearly on the inputs; MLE is unique

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.