Computational Statistics & Data Analysis (MVComp2)

Lecture 7: Linear regression

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Ordinary least squares¹

Functions that map vectors to scalars²

$\frac{\partial (a^{⊺} x)}{\partial x} = a$ $\frac{\partial (x^{⊺} A x)}{\partial x} = (A + A^{⊺}) x$

$RSS (w) = \frac{1}{2} ‖ X w - y ‖_{2}^{2} = \frac{1}{2} (X w - y)^{⊺} (X w - y)$ Let’s compute the gradient by making use of the two identities $\begin{aligned} \nabla_{w} RSS (w) & = \frac{1}{2} \nabla_{w} [(X w)^{⊺} (X w) - 2 (X w)^{⊺} y] \\ = \frac{1}{2} \nabla_{w} w^{⊺} (X^{⊺} X) w - \nabla_{w} ((X^{⊺})^{⊺} w)^{⊺} y \\ = \frac{1}{2} (X^{⊺} X + (X^{⊺} X)^{⊺}) w - X^{⊺} y \\ = X^{⊺} X w - X^{⊺} y \end{aligned}$ Setting the gradient to zero leads to the normal equations $X^{⊺} X w = X^{⊺} y$

Weighted least squares¹

Sometimes the variance may depend on the input.² In this case we may want to associate a specific weight to each example $\begin{aligned} p (y | x, θ) & = N (y | w^{⊺} x, σ^{2} (x)) \\ = \frac{1}{\sqrt{2 π σ^{2} (x)}} \exp (- \frac{1}{2 σ^{2} (x)} (y - w^{⊺} x)^{2}) \\ = N (y | X w, Λ^{- 1}) \end{aligned}$ where $Λ = diag (1 / σ^{2} (x_{n}))$ .

The maximum likelihood estimate yields the weighted least-squares estimate $\hat{w} = (X^{⊺} Λ X)^{- 1} X^{⊺} Λ y$

Prediction accuracy and $R^{2}$ ¹

Quantify accuracy

Compute the residual sum of squares on the dataset (the smaller the better) $RSS (w) = \sum_{n = 1}^{N} (y_{n} - {w^{⊺} x}_{n})^{2}$

Root mean-squared error (RMSE)

RMSE is a more common measure to quantify accuracy $RMSE (w) = \sqrt{\frac{1}{N} RSS (w)} = \sqrt{\frac{1}{N} \sum_{n = 1}^{N} (y_{n} - {w^{⊺} x}_{n})^{2}}$

Coefficient of determination ( $R^{2}$ )

More interpretable measure $R^{2} = 1 - \frac{\sum_{n = 1}^{N} ({\hat{y}}_{n} - y_{n})^{2}}{\sum_{n = 1}^{N} ({\bar{y}}_{n} - y_{n})^{2}} = 1 - \frac{RSS}{TSS}$ where $\bar{y} = \frac{1}{N} \sum_{n = 1}^{N} y_{n}$ is the empirical mean, RSS is the residual sum of squares, and TSS is the total sum of squares. $R^{2}$ measures the variance in the predictions relative to a simple constant prediction, $\bar{y}$ .

Note

$0 \leq R^{2} \leq 1$ , and larger values imply better fit (see figure in previous slide).

Robust linear regression: Multiple strategies

Laplace likelihood

$p (y | x, w, b) = Laplace (y | w^{⊺} x, b) \propto \exp (- \frac{1}{b} | y - w^{⊺} x |)$ MLE can be computed using linear programming.¹

Student- $t$ likelihood

$p (y | x, w, σ^{2}, ν) = {[1 + \frac{1}{ν} {(\frac{y - μ}{σ})}^{2}]}^{- \frac{ν + 1}{2}}$ Model can be fit using stochastic gradient descent.

Huber loss

Mix of $ℓ_{2}$ and $ℓ_{1}$ regularizations at short and long ranges, respectively.²

Bayesian linear regression

Recall multivariate normal likelihood (iid) with known variance $p (D | w, σ^{2}) = \prod_{n = 1}^{N} p (y_{n} | w^{⊺} x, σ^{2}) = N (y | X w, σ^{2} I_{N})$ Compute the posterior over the parameters, $p (w | D, σ^{2})$ . Assume further a Gaussian prior $p (w) = N (w | \overset{˘}{w}, \overset{˘}{Σ})$ Use Bayes’ rule $\begin{aligned} p (w | X, y, σ^{2}) & \propto N (y | X w, σ^{2} I_{N}) N (w | \overset{˘}{w}, \overset{˘}{Σ}) = N (w | \tilde{w}, \tilde{Σ}) \\ \tilde{w} & = \tilde{Σ} ({\overset{˘}{Σ}}^{- 1} \overset{˘}{w} + \frac{1}{σ^{2}} X^{⊺} y) \\ \tilde{Σ} & = {({\overset{˘}{Σ}}^{- 1} + \frac{1}{σ^{2}} X^{⊺} X)}^{- 1} \end{aligned}$ where $\tilde{w}$ and $\tilde{Σ}$ are the posterior mean and covariance, respectively.¹

Introduction¹

Linear regression: Recall $p (y | x, w) = N (y | w^{⊺} x, σ^{2})$ . Useful property: the mean of the output, $E [y | x, w]$ , is a linear function of the inputs $x$ .

Generalized linear model (GLM)

GLM generalizes this linearity property. GLM is a conditional version of an exponential-family distribution. The natural parameters are a linear function of the input $p (y_{n} | x_{n}, w, σ^{2}) = \exp [\frac{y_{n} η_{n} - A (η_{n})}{σ^{2}} + \log h (y_{n}, σ^{2})]$ where

$η_{n} = {w^{⊺} x}_{n}$ is the (input dependent) natural parameter
$A (η_{n})$ is the log partition function
$T (y) = y$ is the sufficient statistic
$σ^{2}$ is the dispersion term

Example: binomial regression

Binomial regression

Recall the binomial distribution for a binary process with $N$ experiments. Model the probability of success with a linear model,¹ $μ_{n} = μ_{n} (w^{⊺} x_{n})$ . $\begin{aligned} p (y_{n} | x_{n}, N_{n}, w_{n}) & = Bin (y_{n} | μ_{n} (w^{⊺} x_{n}), N_{n}) \\ \log p (y_{n} | x_{n}, N_{n}, w) & = y_{n} \log μ_{n} + (N_{n} - y_{n}) \log (1 - μ_{n}) + \log (\binom{N_{n}}{y_{n}}) \\ = y_{n} \log (\frac{μ_{n}}{1 - μ_{n}}) + N_{n} \log (1 - μ_{n}) + \log (\binom{N_{n}}{y_{n}}) \\ = y_{n} η_{n} - A (η_{n}) + h (y_{n}) \end{aligned}$ where $η_{n} = w^{⊺} x_{n}$ , $A (η_{n}) = - N_{n} \log (1 - μ_{n})$ , and $h (y_{n}) = \log (\binom{N_{n}}{y_{n}})$ . This shows that binomial regression can be written in GLM form.

We further have the moments $\begin{aligned} E [y_{n} | x_{n}, N_{n}, w_{n}] & = \frac{d A}{d η_{n}} = \frac{d N_{n} \log (1 + e^{η_{n}})}{d η_{n}} = N_{n} μ_{n} \\ Var [y_{n} | x_{n}, N_{n}, w_{n}] & = \frac{d^{2} A}{d^{2} η_{n}} = N_{n} μ_{n} (1 - μ_{n}) \end{aligned}$

Maximum likelihood estimation¹

Negative log-likelihood

Consider the negative log-likelihood for the GLM $NLL (w) = - \log p (y_{n} | x_{n}, w, σ^{2}) = - \frac{1}{σ^{2}} \sum_{n = 1}^{N} (η_{n} y_{n} - A (η_{n}))$ where $η_{n} = {w^{⊺} x}_{n}$ .

Gradient

$\begin{aligned} \frac{\partial}{\partial w} NLL (w) & = - \frac{1}{σ^{2}} \sum_{n = 1}^{N} \frac{\partial}{\partial η_{n}} (η_{n} y_{n} - A (η_{n})) \frac{\partial η_{n}}{\partial w} \\ = - \frac{1}{σ^{2}} \sum_{n = 1}^{N} (y_{n} - A^{'} (η_{n})) x_{n} \end{aligned}$

Hessian

$H = \frac{\partial}{\partial w \partial w^{⊺}} NLL (w) = \frac{1}{σ^{2}} \sum_{n = 1}^{N} A^{″} (η_{n}) x_{n} x_{n}^{⊺}$ is positive definite, so the MLE for a GLM is unique!

Strategy: Fit GLMs using gradient-based solvers to reach the global minimum.

1 / 34

Computational Statistics & Data Analysis (MVComp2) Lecture 7: Linear regression Tristan Bereau Institute for Theoretical Physics, Heidelberg University

Computational Statistics & Data Analysis (MVComp2)
Table of contents
Introduction
Literature
Recap from last time
Least-squares
Introduction: Linear regression1
Introduction: Linear regression1
Least-squares estimation1
Least-squares estimation
Ordinary least squares1
Ordinary least squares: Solution
Ordinary least squares: Uniqueness of solution
Ordinary least squares: Uniqueness of solution
Weighted least squares1
MLE for $σ^{2}$
Measuring goodness of fit
Residual and parity plots1
Prediction accuracy and $R^{2}$ 1
Robust linear regression
Robust linear regression: Multiple strategies
Bayesian linear regression
Sequential Bayesian inference
Flavors of linear regression
Generalized linear models
Introduction1
Moments of GLMs
Example: linear regression
Example: binomial regression
Maximum likelihood estimation1
Summary
Summary
References
Murphy, Kevin P....