Computational Statistics & Data Analysis (MVComp2)

Lecture 4: Error propagation, Chebyshev inequality, Moment-generating functions, central limit theorem, and Multivariate distributions

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Today’s lecture

  • Ch. 2-6 in Wackerly, Mendenhall, and Scheaffer (2014)
  • Ch. 2 of Amendola (2021)

Recap from last time

Discrete probability distributions
Uniform, Bernoulli, Binomial, Geometric, Hypergeometric, Poisson
Continuous probability distributions
  • prob distr. function as derivative of CDF
  • Uniform, Gaussian (normal), Beta, Gamma, Exponential
Exponential family distributions
  • Log partition function is cumulant generating function
  • Conjugate prior: consistent prior, likelihood, and posterior

Error propagation

Transformation of variables

Given an RV \(X\) and its PDF \(f(x)\), we might be interested in the (monotonic) function \(y(x)\). Conservation of probability leads to \[ f(x)\text{d}x = g(y) \text{d}y \] and therefore the new PDF \(g(y)\) yields \[ g(y) = f(x)\left\vert \frac{\text{d}x}{\text{d}y} \right\vert \]

Transformation of variables: multidimensional case

Analogously, the transformation from \(x_1, x_2, \dots\) to \(y_1, y_2, \dots\) will rely on the Jacobian of the transformation \[ J_{ij} = \frac{\partial x_i}{\partial y_j} \] such that \[ g(y_i) = f(x_i)\left\vert J \right\vert \] where \(\Vert\) denotes the determinant.

Error propagation

Use the transformation of variables to find the error (standard deviation) associated to a function of an RV \(X\) in the limit of small deviations from the mean.

Suppose \(X\) is distributed as \(f(x)\) with mean \(\mu\) and variance \(\sigma^2\). We are interested in the PDF of the function \(y(x)\). Expand \(y\) around \(\mu\) \[ y(x) \approx y(\mu) + \frac{\text{d}y}{\text{d}x}\Big\vert_\mu (x-\mu) \]

Error propagation

Mean

\[ \begin{align*} E[Y] =& \int \text{d}x\, y(\mu)f(x) + \int \text{d}x \, \frac{\text{d}y}{\text{d}x}\Big\vert_\mu (x-\mu) f(x) \\ =& \int \text{d}x\, y(\mu)f(x) + \frac{\text{d}y}{\text{d}x}\Big\vert_\mu \underbrace{\int \text{d}x \,(x-\mu) f(x)}_{0} \\ =& y(\mu)\underbrace{\int \text{d}x\, f(x)}_1 \\ =& y(\mu) \end{align*} \]

Error propagation

Second moment about the origin

\[ \begin{align*} E[Y^2] =& \int \text{d}x\, \left[ y^2(\mu) + (y'(\mu))^2(x-\mu)^2 + 2y(\mu)y'(\mu)(x-\mu) \right] f(x) \\ =& y^2(\mu) + y'^2 \sigma_X^2 \end{align*} \] where \(y' = \frac{\text{d}y}{\text{d}x}\Big\vert_\mu\).

Variance

\[ \sigma_Y^2 = E[(Y - y(\mu))^2] = E[Y^2] - y^2(\mu) = y'^2\sigma_X^2 \]

Error propagation

Extension to several independent variables

For independent RVs \(\{X_i\}\), where \(f(x_1, x_2, \dots) = \prod_i f_i(x_i)\). Then the function \(y(x_1, x_2, \dots)\) yields \[ \sigma_Y^2 = \sum_i {y'_i}^2\sigma_{X_i}^2 \]

Additivity of the variance

Consider \(y = x_1 + x_2 + \dots + x_n\) with independent RVs \(\{X_i\}\). Error propagation tells us that \[ \sigma_Y^2 = \sum_i \sigma_{X_i}^2 \] The variance of a sum of independent RVs is the sum of the variances.

Variance of the sample mean

Consider the sample mean of a number of datapoints \(x_i\) from the same distribution with variance \(\sigma^2\): \(\bar x = \frac 1N \sum_i x_i\). We’re looking for the variance of the sample mean: \[ \begin{align*} \text{Var}[\bar{X}] &= \text{Var}[\frac{X_1 + X_2 + \dots + X_N}N] \\ &= \frac 1{N^2} (\text{Var}[X_1]+ \text{Var}[X_2]+ \dots + \text{Var}[X_N]). \end{align*} \] So the sample variance of the sample mean is \[ \sigma_{\bar x}^2 = \frac 1{N^2} \sum_i \sigma^2 = \frac{\sigma^2}{N} \] The variance of the mean is smaller than the variance of each individual data point by a factor \(1/N\).

Chebyshev’s inequality

Chebyshev’s inequality:1 Setting

Given an RV \(X\), what can we say about the probability of observing the event \(P(X \leq x_0)\) or \(P(X \geq x_0)\)?

If we know the density (or mass) function up to that point, it’s easy \[ P(X \leq x_0) = \int_{-\infty}^{\color{red}{x_0}} \text{d}u \, f(u) \]

What if we don’t know \(f(x)\)?

Chebyshev’s inequality: Statement

Let’s assume we don’t know \(f(x)\), but we do know the first two moments, \(\mu\) and \(\sigma^2\).

Chebyshev’s inequality

\[ P\left(|X-\mu| \geq k\sigma\right) \leq \frac 1{k^2} \]

Chebyshev’s inequality: Illustration

Chebyshev’s inequality: Proof1

Decompose the variance \[ \begin{align} \text{Var}[X] =& \int_{\color{red}{-\infty}}^{\color{red}{\mu-k\sigma}} \text{d}s\, (s-\mu)^2 f(s) \\ &+ \int_{\color{red}{\mu-k\sigma}}^{\color{red}{\mu+k\sigma}} \text{d}s\, (s-\mu)^2 f(s) \\ &+ \int_{\color{red}{\mu+k\sigma}}^{\color{red}{\infty}} \text{d}s\, (s-\mu)^2 f(s) \\ \end{align} \]

Chebyshev’s inequality: Proof (cont’d)

Negative tail
we have \(s \leq \mu - k\sigma\) \[ \int_{-\infty}^{\mu-k\sigma} \text{d}s\, (s-\mu)^2 f(s) \geq \int_{-\infty}^{\mu-k\sigma} \text{d}s\, (k\sigma)^2 f(s) \]
Positive tail
Likewise: \(s \geq \mu + k\sigma\) \[ \int_{\mu+k\sigma}^{\infty} \text{d}s\, (s-\mu)^2 f(s) \geq \int_{\mu+k\sigma}^{\infty} \text{d}s\, (k\sigma)^2 f(s) \]

Chebyshev’s inequality: Proof (cont’d)

Middle of the interval
Unfortunately, there’s not much we can say \[ \int_{\mu-k\sigma}^{\mu+k\sigma} \text{d}s\, (s-\mu)^2 f(s) \geq 0 \]

Chebyshev’s inequality: Proof (cont’d)

Let’s sum all three contributions \[ \text{Var}[X] = \sigma^2 \geq (k\sigma)^2 \left( \int_{-\infty}^{\mu-k\sigma} \text{d}s\, f(s) + \int_{\mu+k\sigma}^{\infty} \text{d}s\, f(s) \right) \]

\[ \frac 1{k^2} \geq P(X \leq \mu - k\sigma) + P(X \geq \mu + k\sigma) = P(\vert X-\mu \vert \geq k\sigma) \]

Note

Chebyshev’s inequality is useful to establish some loose bounds on the probability for an event. Useful when the underlying distribution is unkown / difficult to evalute.

Moment-generating functions

Two types of moments!

Moment about the origin
The \(k\)th moment of an RV \(Y\) taken about the origin is defined as \[E[Y^k] = \mu'_k\]
Centered moment
The \(k\)th moment of an RV \(Y\) taken about its mean is defined as \[E[(Y-\mu)^k] = \mu_k\]

Moments define the distribution

Statement1

Given two distributions \(Y\) and \(Z\) with identical sets of moments: \[ \begin{align*} \mu'_{1Y} =& \mu'_{1Z}\\ \mu'_{2Y} =& \mu'_{2Z}\\ \mu'_{3Y} =& \mu'_{3Z}\\ &\vdots \end{align*} \] the distributions are identical.

Moment-generating function

Definition

The moment-generating function \(m(t)\) for an RV \(Y\) is defined as \[ m_Y(t) = E[\text{e}^{tY}] \]

\(m(t)\) is said to exist if it is finite for some \(|t| \leq b\) where \(b\) is a positive constant.

Discrete distribution

\[ m_Y(t) = E[\text{e}^{tY}] = \sum_y \text{e}^{ty} p(y) \]

Continuous distribution

\[ m_Y(t) = E[\text{e}^{tY}] = \int \text{d}y \, \text{e}^{ty} f(y) \]

Moment-generating function

Why is it called moment-generating function? Consider the series expansion of \(\text{e}^{tY}\) \[ \text{e}^{ty} = 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \]

Assuming that \(\mu'_k\) is finite for \(k=1,2,3,\dots\), we have \[ E[\text{e}^{tY}] = \sum_y \text{e}^{ty} p(y) = \sum_y \left[ 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \right] p(y) \]

Moment-generating function

\[ \begin{align} m_Y(t) &= E[\text{e}^{tY}] = \sum_y \left[ 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \right] p(y) \\ &= \sum_y p(y) + t\sum_y p(y) + \frac{t^2}{2!} \sum_y y^2 p(y) + \frac{t^3}{3!} \sum_y y^3 p(y) + \dots \\ &= 1 + tE[Y] + \frac{t^2}{2!} E[Y^2] + \frac{t^3}{3!} E[Y^3] + \dots \end{align} \]

\[ \boxed{\left.\frac{\text{d}^l m_X(t)}{dt^l}\right|_{t=0} = E[X^l]} \]

Moment-generating function for central moments

MGF for central moments

\[ m_{X-\mu}(t) = E[\text{e}^{t(X-\mu)}] = \int \text{d}x \, \text{e}^{t(x-\mu)} f(x) \]

Moment-generating function for sum of independent RVs

MGF for sum of RVs

Given two independent RVs, \(X\) and \(Y\), distributed as \(f(x)\) and \(g(y)\), the MGF of the sum yields \[ \begin{align*} m_{X+Y}(t) =& E[\text{e}^{t(X+Y)}] \\ =& \int \text{d}x\text{d}y \, \text{e}^{t(x+y)} f(x)g(y) \\ =& \int \text{d}x \, \text{e}^{tx} f(x) \int \text{d}y \, \text{e}^{ty} g(y) \\ =& m_X(t)m_Y(t) \end{align*} \]

The MGF of the sum of two independent RVs is the product of the individual MGFs.

Example: MGF for Poisson distribution1

\[ \begin{align} E[\text{e}^{kX}] &= \sum_{x=0}^\infty \left( \frac{\lambda^x}{x!}\text{e}^{-\lambda} \right) \text{e}^{kx} \\ &= \text{e}^{-\lambda} \sum_{x=0}^\infty \frac{(\lambda \text{e}^k)^x}{x!} \\ &= \text{e}^{-\lambda} \text{e}^{\lambda \text{e}^k} \\ &= \text{e}^{\lambda(\text{e}^k-1)} \end{align} \]

Example: MGF for Poisson distribution

The first two moments about the origin are thus \[ \begin{align} E[x] &= \frac{\text{d}}{\text{d}k}[\text{e}^{\lambda(\text{e}^k-1)}] = \text{e}^{\lambda(\text{e}^k-1)}(\lambda \text{e}^k) \\ E[x^2] &= \frac{\text{d}^2}{\text{d}k^2}[\text{e}^{\lambda(\text{e}^k-1)}] = \text{e}^{\lambda(\text{e}^k-1)} [(\lambda \text{e}^k)^2 + \lambda \text{e}^k] \end{align} \]

\[ \begin{align} \mu &= \left.\text{e}^{\lambda(\text{e}^k-1)}(\lambda \text{e}^k)\right|_{k=0} = \color{red}\lambda\\ \mu'_2 &= \left.\text{e}^{\lambda(\text{e}^k-1)} [(\lambda \text{e}^k)^2 + \lambda \text{e}^k]\right|_{k=0} = \lambda^2 + \lambda \\ \sigma^2 &= E[Y^2]-\mu^2 = \mu'_2 - \mu^2 = \color{red}\lambda \end{align} \]

Central limit theorem

Central limit theorem

Consider \(n\) iid RVs with mean \(\mu\) and variance \(\sigma^2 < \infty\) and unknown probability distribution function.

Central limit theorem

Define the RV \[ Y := \frac{\sum_{i=1}^n x_i - n\mu}{\sigma \sqrt n} = \frac{\hat x - \mu}{\sigma / \sqrt{n}}. \] \(Y\) tends to a standard normal distribution, \(Y\sim\mathcal{N}(0,1)\), for \(n\to\infty\).

Central limit theorem: Proof

Define the normal variables \(z_i = \frac{x_i - \mu}\sigma\), s.t. \(\langle z_i \rangle = 0\) and \(\langle z_i^2 \rangle = 1\). \[ Y = \frac 1{\sqrt{n}} \sum_i z_i \]

The moment-generating function of \(Y\) is \[ m_Y(t) = \langle \text{e}^{Yt} \rangle = \langle \text{e}^{z_it / \sqrt{n}} \rangle^n \]

Central limit theorem: Proof (cont’d)

\[ \begin{align*} m_Y(t) &= \langle \text{e}^{Yt} \rangle = \langle \text{e}^{z_it / \sqrt{n}} \rangle^n \\ &= \langle 1 + \frac{z_it}{\sqrt{n}} + \frac{z_i^2t^2}{2!n} + \dots \rangle^n \\ &= \left(1 + \frac{\langle z_i\rangle t}{\sqrt{n}} + \frac{\langle z_i^2\rangle t^2}{2!n} + \dots \right)^n \\ &= \left(1 + \frac{t^2}{2n} + \dots \right)^n \approx 1 + \frac {t^2}2 \approx \text{e}^{\frac 12 t^2} \end{align*} \] where we assume \(t \ll 1\), i.e., near the origin. We obtain the MGF of a Normal variable.

Central limit theorem: illustration

MGF for a normal Gaussian distribution (blue thick line); normalized sum of independent uniform RVs (red thin lines) for \(n=1,3,5\) from bottom up. (Fig 2.6 in Amendola (2021))

Central limit theorem: impact

CLT guarantees that if the errors in a measure are the results of many independent errors due to various parts of the experiment, then they are expected to be distributed in a Gaussian way.

Multivariate probability distributions

Joint probability function

Joint probability function
We consider multiple RVs simultaneously (jointly)1 \[ \begin{align*} &p(x_1, \dots, x_k) = p(X_1 = x_1, \dots, X_k = x_k) \geq 0 \\ &\sum_{x_1}\dots\sum_{x_k} p(x_1, \dots, x_k) = 1 \end{align*} \]

Joint distribution function1

Joint distribution function

\[ F(x_1, x_2) = \int_{-\infty}^{x_1} \text{d}t_1 \int_{-\infty}^{x_2} \text{d}t_2 \, f(t_1, t_2) \]

  1. \(F(-\infty, -\infty) = F(-\infty, x_2) = F(x_1, -\infty) = 0\)
  2. \(F(\infty, \infty) = 1\)
  3. if \(x_1^* \geq x_1\) and \(x_2^* \geq x_2\), then (see figure) \[F(x_1^*, x_2^*) + F(x_1, x_2) \geq F(x_1^*, x_2) + F(x_1, x_2^*)\]

Multi-categorical probability distributions

Multi-categorical distribution
Extends the Bernoulli distribution to more than two distinct outcomes (e.g., customer choices from a lunch menu). Often represented through \(K\)-dimensional binary vectors \({\bf x}\) with components \(x_i \in \{0,1\}\), \(i=1,\dots,K\), \(\sum_i x_i = 1\). Probability function: \[ p({\bf x} | {\bf \pi}) = \prod_{i=1}^K \pi_i^{x_i} \]

Multi-categorical probability distributions

We have a vector of expectations and a \(K\times K\) matrix of variances and covariances \[ \begin{align} E[X_i] &= \pi_i \\ \text{Var}[X_i] &= \pi_i ( 1 - \pi_i)\\ \text{Cov}[X_i, X_j] &= -\pi_i \pi_j \end{align} \]

Multinomial probability distributions

Multinomial distribution
Extends multi-categorical with the Binomial, i.e., multiple iid with more than two categories. Probability function \[ p(x_1,\dots,x_K) = \frac{N!}{x_1!\dots x_K!}\prod_{i=1}^K \pi_i^{x_i} \]

Multinomial probability distributions

Expected values, variances, and covariances1: \[ \begin{align} E[X_i] &= N\pi \\ \text{Var}[X_i] &= N\pi_i(1-\pi_i) \\ \text{Cov}[X_i, X_j] &= -N\pi_i \pi_j, \ \text{for}\ i \neq j \end{align} \]

Multivariate normal distribution

Multivariate normal distribution
Consider multiple continuous RVs / features simultaneously, \({\bf x} = (x_1, \dots, x_p)^\intercal\). The distribution with mean vector \({\bf \mu}\) and \(p \times p\) covariance matrix \({\bf \Sigma}\) takes the form \[ f({\bf x}) = \frac 1{(2\pi)^{p/2} \vert {\bf \Sigma}\vert^{1/2}}\text{e}^{-\frac 12 ({\bf x} - {\bf \mu})^\intercal {\bf \Sigma}^{-1} ({\bf x} - {\bf \mu})} \] often abbreviated as \({\bf x}\sim \mathcal{N}({\bf \mu}, {\bf \Sigma})\).

Multivariate normal distribution

Parameters
\({\bf \mu}\) and \({\bf \Sigma}\) are parameters of the distribution, with entries \(\mu_i = E[X_i]\) and \(\Sigma_{ij} = \sigma_{ij}^2 = \text{Cov}[X_i, X_j]\).

Summary

Summary

Chebyshev’s inequality
loose bound on tail/extreme events when only the first two moments of a distribution are known
Moment-generating function
Series expansion of the exponential function; Easily extract moments about the origin
Multivariate probability distributions
  • Multi-categorical extends Bernoulli to \(k\) categories
  • Multinomial extends Binomial to \(k\) categories
  • Multivariate normal distribution: simple form

References

Amendola, Luca. 2021. “Lecture Notes on Statistical Methods.” https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf.
Wackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.