Computational Statistics & Data Analysis (MVComp2)

Lecture 5: Parameter estimation

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

  • Chapters 2 and 3 in Amendola (2021)

Recap from last time

Chebyshev’s inequality
loose bound on tail/extreme events when only the first two moments of a distribution are known
Moment-generating function
Series expansion of the exponential function; Easily extract moments about the origin
Multivariate probability distributions
  • Multi-categorical extends Bernoulli to \(k\) categories
  • Multinomial extends Binomial to \(k\) categories
  • Multivariate normal distribution: simple form

Statistical inference

Small samples from large populations

From Durstewitz (2017)

Notation

  • Roman letters (e.g., \(t\)) for a statistic1 obtained from a sample
  • Greek letters (e.g., \(\theta\)) for a population parameter
  • Greek letters with a hat (e.g., \(\hat\theta\)) for empirical estimates of the true population parameter

Inference

Obtain estimates of unknown distribution parameters from observations, \(\hat \theta(x_i)\).

Note

  • Since \(x_i\) are RVs, \(\hat \theta(x_i)\) are also RVs.
  • Example: Assuming the distribution of our datapoints follows a Gaussian, what are the population mean and variance, \(\mu\) and \(\sigma^2\)?

Estimator

Rule to calculate an estimate of a given quantity based on observed data.

Example

\[ \bar x = \hat\mu = \frac 1N \sum_{i=1}^N x_i \] is an estimator of the mean, \(\mu = E[X]\).

What makes an estimator good?

Bias of the estimator

Expected value of \(\hat \theta = \hat \theta(x_1, x_2, \dots)\) is \[ E[\hat\theta] = \int \text{d}x_1\dots\text{d}x_n \hat \theta f(x_1, \dots, x_n; \theta) \]

Bias
The bias of estimator \(\hat \theta\) is \(b = E[\hat \theta] - \theta\)

Note

  • if \(b=0\) for every \(N\), the estimator is unbiased
  • if \(b\to 0\) only for large \(N\), the estimator is asymptotically unbiased
  • The bias is a systematic error that depends on the choice of estimator

Variance of the estimator

Variance of the estimator \[ \text{Var}[\hat\theta] = \int \text{d}x_1\dots\text{d}x_n (\hat \theta - E[\hat\theta])^2 f(x_1, \dots, x_n; \theta) \]

Note

  • The variance of \(\hat \theta\) is a statistical error: it is unavoidable (but can be minimized) because \(\hat \theta\) is a random variable

Bias-variance tradeoff

Overall measure of the accuracy of an estimate accounts for both its (squared) bias and variance, denoted mean-squared error \[ \begin{align} \text{MSE} &= E[(\hat\theta-\theta)^2] \\ &= E[[(\hat \theta - E[\hat \theta]) + (E[\hat \theta] - \theta)]^2] \\ &= E[(\hat \theta - E[\hat\theta])^2] + 2\underbrace{E[\hat\theta - E[\hat\theta]]}_{0}(E[\hat\theta] - \theta) + (E[\hat\theta - \theta])^2 \\ &= \text{Var}[\hat\theta] + b^2 \end{align} \]

Normalization factor in sample variance

Examples

\[ \bar x = \hat\mu = \frac 1N \sum_{i=1}^N x_i \] is an unbiased estimator of the mean, \(\mu = E[X]\).

\[ \hat \sigma^2 = \frac 1{N-1} \sum_{i=1}^N (x_i - \bar x)^2 \] is an unbiased estimator of the variance, \(\sigma^2 = \text{Var}[X]\).1

Statistical parameter estimation

Least-squares error (LSE) estimation

Least-squares error
Set of parameters that yields the smallest squared model errors, i.e., residuals, for the univariate pairs \(\{x_i, y_i\}\) and model \(f = f(x_i; {\bf \beta})\): \[ \text{LSE} = \sum_{i=1}^N (y_i - f(x_i; {\bf \beta}))^2 \]

Estimates

Set gradients to zero \[ \frac{\partial}{\partial \beta_j} \text{LSE} = 0 \]

Least-squares error (LSE): example

Consider model \(f(x_i; {\bf \beta}) = \beta_0 + \beta_1 x_i + \varepsilon_i\), where \({\bf \varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbb{I})\).

  • \(\frac{\partial}{\partial \beta_0} \text{LSE} = 0\): \[ \sum_{i=0}^N -2(y_i - \hat\beta_0 - \hat\beta_1 x_i) = 0 \Rightarrow \hat \beta_0 = \bar y - \hat\beta_1 \bar x \]
  • \(\frac{\partial}{\partial \beta_1} \text{LSE} = 0\): \[ \sum_{i=0}^N -2x_i(y_i - \hat\beta_0 - \hat\beta_1 x_i) = 0 \Rightarrow \hat \beta_1 = \frac{\text{Cov}[X,Y]}{\text{Var}[X]} \]

Maximum likelihood estimation (MLE)1

Maximize the likelihood of observing datapoints given some parameters, \(p({\bf x}| {\theta})\). In the frequentist view, these parameters are assumed to be (unknown) constants. For iid data we have \[ L_{\bf X}(\theta) := p({\bf X}|\theta) = \prod_i p(x_i | \theta) \]

Note

Practically, we usually seek to maximize the log-likelihood, \(l_{\bf X}(\theta) := \log L_{\bf X}(\theta)\). This converts products into sum, and is especially convenient for exponential distributions.

Example: MLE for Gaussian \(\mu\)1

Estimate the population mean \(\mu\) under the univariate normal model \[ x_i \sim \mathcal{N}(\mu, \sigma^2) = \frac 1{\sqrt{2\pi}\sigma} \text{e}^{-\frac 12 (\frac{x-\mu}{\sigma})^2} \] The log-likelihood is given by \[ \begin{align*} l_{\bf X}(\mu) &= \sum_i \left[ \log\left(\frac 1{\sqrt{2\pi}\sigma}\right) - \frac12 \left(\frac{x_i-\mu}{\sigma}\right)^2 \right] \\ &\frac{\partial l_{\bf X}}{\partial \mu} = 0 \Rightarrow \hat \mu = \frac 1N \sum_i x_i = \bar x. \end{align*} \]

Example: MLE for Gaussian variance

Now consider ML estimation for Gaussian variance \[ \frac{\partial l_{\bf X}}{\partial \sigma} = 0 \Rightarrow \hat \sigma^2 = \frac 1N \sum_i (x_i-\hat\mu)^2 \] which is a biased estimator of the variance. It underestimates \(\sigma^2\) by a factor \((N-1)/N\). Note, though, that it is still asymptotically unbiased (i.e., for \(n \to \infty\)).

Bayesian inference

Instead of determining the likelihood \(p({\bf X}|{\bf \theta})\), we would prefer to determine \(p({\bf \theta} | {\bf X})\). Use Bayes’ theorem \[ p(\theta | {\bf X}) = \frac{p({\bf X}|{\bf \theta})p_\alpha({\bf \theta})}{p({\bf X})} \] which coincides with MLE only if with set a uniform prior.1

Prior distribution
\(p_\alpha(\theta) := p(\theta|\alpha)\) is governed by a set of hyper-parameters \(\alpha\).

Bayesian inference: MAP estimate

\[ p(\theta | {\bf X}) = \frac{p({\bf X}|{\bf \theta})p_\alpha({\bf \theta})}{p({\bf X})} \]

Maximum-a-posteriori (MAP) estimate
Take the largest mode (i.e., global maximum) of the posterior distribution, \({\bf \hat\theta} := \underset{\theta}{\arg\max} \, p(\theta | {\bf X})\)

Bayesian vs Frequentist

  • Bayesian: fix data, vary models
  • Frequentist: fix model, vary data

Sampling the posterior

Sampling the posterior

  • The posterior, \(L\), is a surface in an \(N\)-dimensional parameter space.
  • Strategies:
    • Analytical expression or approximation
    • Sampling

Fisher matrix1

Approximate the likelihood with a (multivariate) Gaussian distribution2 \[ L ({\bf x}; \theta)\approx N \exp\left[ -\frac 12 (\theta_\alpha - \hat\theta_\alpha) F_{\alpha\beta} (\theta_\beta - \hat\theta_\beta) \right] \]

  • The Fisher matrix is the inverse of the correlation matrix among the parameters evaluated at \(\hat \theta\)
  • The likelihood is a Gaussian function of the parameters, not (necessarily) of the data

Fisher matrix

Near the maximum likelihood value \(\hat \theta_\alpha\) of the parameters, the first derivatives vanish, and thus \[ \log L({\bf x}; \theta_\alpha) \approx \log L({\bf x}; \hat\theta_\alpha) + \frac 12 \frac{\partial^2 \log L({\bf x}; \theta_\alpha)}{\partial \theta_\alpha \partial \theta_\beta} (\theta_\alpha - \hat\theta_\alpha)(\theta_\beta-\hat\theta_\beta) \]

The Fisher matrix is the negative expected value of the Hessian of the log-likelihood \[ F_{\alpha\beta} = - E_{\bf x}\left[\frac{\partial^2 \log L({\bf x}; \theta)}{\partial\theta_\alpha \partial\theta_\beta}\right]. \]

It quantifies the amount of information that the data \({\bf x}\) carry about the parameters \(\theta\).

Monte Carlo methods

Alternatively, we can avoid approximating the likelihood, and instead sample from it.

Monte Carlo Markov chain
Sample points by proposing random moves that only depend on the last draw, \(P(x_{i} | x_{i-1})\)
Metropolis-Hastings
Propose a randomly altered \({\bf x}_0\), denoted \({\bf x}_1\). Accept move \({\bf x}_0 \to {\bf x}_1\) if posterior ratio \(\frac{P({\bf x}_1)}{P({\bf x}_0)}\) is either larger than 1 (i.e., moving toward the peak) or move \(a\%\) of the times.

Gradient descent

If we have a differentiable function we can more efficiently find a local minimum. Here: negative log-likelihood, \(-l_{\bf X}(\theta)\).

Gradient descent
First-order iterative optimization algorithm for finding a local minimum \[ \hat\theta_{n+1} = \hat\theta_n + \eta \frac{\partial l_{\bf X}(\theta)}{\partial \hat \theta_n} \] where the parameter \(\eta\) is a learning rate.

Newton–Raphson

Netwon–Raphson
Extends gradient descent by replacing the learning rate by the Hessian matrix \[ \hat\theta_{n+1} = \hat\theta_n + {\bf H}^{-1} \frac{\partial l_{\bf X}(\theta)}{\partial \hat \theta_n} \] where \({\bf H}_{ij} = \frac{\partial^2 l_{\bf X}(\theta)}{\partial \theta_{n,i} \partial \theta_{n,j}}\) adjusts the gradient wrt the size of the local change in slope.

Automatic differentiation (JAX)

Where do these derivatives come from?

Different types of calculating derivatives on the computer:

  • Numeric: Finite-difference approximations
  • Symbolic: Symbolic representation of the derivative
  • Automatic: Uses the chain rule to operate on minimial elements

Modern statistics / machine learning libraries implement automatic differentiation.1

Python example

import jax
import jax.numpy as jnp
from jax import grad, value_and_grad, jit
import optax

# Define a simple quadratic loss function
def loss_fn(params, x, y):
    predictions = jnp.dot(x, params)
    return jnp.mean((predictions - y) ** 2)

# Initialize parameters, here we'll assume a simple linear model with 1 parameter
params = jnp.array([1.0])

# Your data, x and y
x = jnp.array([[1.0], [2.0], [3.0]])
y = jnp.array([2.0, 4.0, 6.0])

# Setup the optimizer, here we use stochastic gradient descent with a learning rate of 0.1
optimizer = optax.sgd(0.1)

# Initialize the optimizer state
opt_state = optimizer.init(params)

# Wrap the update step in a `jit`'d function for performance
@jit
def update(params, x, y, opt_state):
    """Single optimization step."""
    # Compute the gradient
    grads = grad(loss_fn)(params, x, y)
    # Update the parameters and optimizer state
    updates, opt_state = optimizer.update(grads, opt_state)
    params = optax.apply_updates(params, updates)
    return params, opt_state

# Run the optimization loop for a number of steps
for _ in range(100):
    params, opt_state = update(params, x, y, opt_state)

# Print out the optimized parameters
print("Optimized Parameters:", params)
Optimized Parameters: [2.]

Summary

Summary

Statistical inference
Obtain estimates of unknown distribution parameters
Estimator
Bias, variance, mean-squared error
Statistical parameter estimation
Least-squares error, maximum likelihood estimation, Bayesian inference
Sampling the posterior
Fisher matrix, Monte Carlo methods, Gradient descent and Newton–Raphson

References

Amendola, Luca. 2021. “Lecture Notes on Statistical Methods.” https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf.
Durstewitz, Daniel. 2017. “Advanced Data Analysis in Neuroscience.” Bernstein Series in Computational Neuroscience. Cham: Springer.