Computational Statistics & Data Analysis (MVComp2)

Lecture 4: Error propagation, Chebyshev inequality, Moment-generating functions, central limit theorem, and Multivariate distributions

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Today’s lecture

Ch. 2-6 in Wackerly, Mendenhall, and Scheaffer (2014)
Ch. 2 of Amendola (2021)

Recap from last time

Discrete probability distributions

Uniform, Bernoulli, Binomial, Geometric, Hypergeometric, Poisson

Continuous probability distributions

prob distr. function as derivative of CDF
Uniform, Gaussian (normal), Beta, Gamma, Exponential

Exponential family distributions

Log partition function is cumulant generating function
Conjugate prior: consistent prior, likelihood, and posterior

Error propagation

Transformation of variables

Given an RV \(X\) and its PDF \(f(x)\), we might be interested in the (monotonic) function \(y(x)\). Conservation of probability leads to \[ f(x)\text{d}x = g(y) \text{d}y \] and therefore the new PDF \(g(y)\) yields \[ g(y) = f(x)\left\vert \frac{\text{d}x}{\text{d}y} \right\vert \]

Transformation of variables: multidimensional case

Analogously, the transformation from \(x_1, x_2, \dots\) to \(y_1, y_2, \dots\) will rely on the Jacobian of the transformation \[ J_{ij} = \frac{\partial x_i}{\partial y_j} \] such that \[ g(y_i) = f(x_i)\left\vert J \right\vert \] where \(\Vert\) denotes the determinant.

Error propagation

Use the transformation of variables to find the error (standard deviation) associated to a function of an RV \(X\) in the limit of small deviations from the mean.

Suppose \(X\) is distributed as \(f(x)\) with mean \(\mu\) and variance \(\sigma^2\). We are interested in the PDF of the function \(y(x)\). Expand \(y\) around \(\mu\) \[ y(x) \approx y(\mu) + \frac{\text{d}y}{\text{d}x}\Big\vert_\mu (x-\mu) \]

Error propagation

Mean

\[ \begin{align*} E[Y] =& \int \text{d}x\, y(\mu)f(x) + \int \text{d}x \, \frac{\text{d}y}{\text{d}x}\Big\vert_\mu (x-\mu) f(x) \\ =& \int \text{d}x\, y(\mu)f(x) + \frac{\text{d}y}{\text{d}x}\Big\vert_\mu \underbrace{\int \text{d}x \,(x-\mu) f(x)}_{0} \\ =& y(\mu)\underbrace{\int \text{d}x\, f(x)}_1 \\ =& y(\mu) \end{align*} \]

Error propagation

Second moment about the origin

\[ \begin{align*} E[Y^2] =& \int \text{d}x\, \left[ y^2(\mu) + (y'(\mu))^2(x-\mu)^2 + 2y(\mu)y'(\mu)(x-\mu) \right] f(x) \\ =& y^2(\mu) + y'^2 \sigma_X^2 \end{align*} \] where \(y' = \frac{\text{d}y}{\text{d}x}\Big\vert_\mu\).

Variance

\[ \sigma_Y^2 = E[(Y - y(\mu))^2] = E[Y^2] - y^2(\mu) = y'^2\sigma_X^2 \]

Error propagation

Extension to several independent variables

For independent RVs \(\{X_i\}\), where \(f(x_1, x_2, \dots) = \prod_i f_i(x_i)\). Then the function \(y(x_1, x_2, \dots)\) yields \[ \sigma_Y^2 = \sum_i {y'_i}^2\sigma_{X_i}^2 \]

Additivity of the variance

Consider \(y = x_1 + x_2 + \dots + x_n\) with independent RVs \(\{X_i\}\). Error propagation tells us that \[ \sigma_Y^2 = \sum_i \sigma_{X_i}^2 \] The variance of a sum of independent RVs is the sum of the variances.

Variance of the sample mean

Consider the sample mean of a number of datapoints \(x_i\) from the same distribution with variance \(\sigma^2\): \(\bar x = \frac 1N \sum_i x_i\). We’re looking for the variance of the sample mean: \[ \begin{align*} \text{Var}[\bar{X}] &= \text{Var}[\frac{X_1 + X_2 + \dots + X_N}N] \\ &= \frac 1{N^2} (\text{Var}[X_1]+ \text{Var}[X_2]+ \dots + \text{Var}[X_N]). \end{align*} \] So the sample variance of the sample mean is \[ \sigma_{\bar x}^2 = \frac 1{N^2} \sum_i \sigma^2 = \frac{\sigma^2}{N} \] The variance of the mean is smaller than the variance of each individual data point by a factor \(1/N\).

Chebyshev’s inequality

Chebyshev’s inequality:¹ Setting

Given an RV \(X\), what can we say about the probability of observing the event \(P(X \leq x_0)\) or \(P(X \geq x_0)\)?

If we know the density (or mass) function up to that point, it’s easy \[ P(X \leq x_0) = \int_{-\infty}^{\color{red}{x_0}} \text{d}u \, f(u) \]

What if we don’t know \(f(x)\)?

Chebyshev’s inequality: Statement

Let’s assume we don’t know \(f(x)\), but we do know the first two moments, \(\mu\) and \(\sigma^2\).

Chebyshev’s inequality

\[ P\left(|X-\mu| \geq k\sigma\right) \leq \frac 1{k^2} \]

Chebyshev’s inequality: Illustration

Chebyshev’s inequality: Proof¹

Decompose the variance \[ \begin{align} \text{Var}[X] =& \int_{\color{red}{-\infty}}^{\color{red}{\mu-k\sigma}} \text{d}s\, (s-\mu)^2 f(s) \\ &+ \int_{\color{red}{\mu-k\sigma}}^{\color{red}{\mu+k\sigma}} \text{d}s\, (s-\mu)^2 f(s) \\ &+ \int_{\color{red}{\mu+k\sigma}}^{\color{red}{\infty}} \text{d}s\, (s-\mu)^2 f(s) \\ \end{align} \]

Chebyshev’s inequality: Proof (cont’d)

Negative tail: we have \(s \leq \mu - k\sigma\) \[ \int_{-\infty}^{\mu-k\sigma} \text{d}s\, (s-\mu)^2 f(s) \geq \int_{-\infty}^{\mu-k\sigma} \text{d}s\, (k\sigma)^2 f(s) \]

Positive tail: Likewise: \(s \geq \mu + k\sigma\) \[ \int_{\mu+k\sigma}^{\infty} \text{d}s\, (s-\mu)^2 f(s) \geq \int_{\mu+k\sigma}^{\infty} \text{d}s\, (k\sigma)^2 f(s) \]

Chebyshev’s inequality: Proof (cont’d)

Middle of the interval: Unfortunately, there’s not much we can say \[ \int_{\mu-k\sigma}^{\mu+k\sigma} \text{d}s\, (s-\mu)^2 f(s) \geq 0 \]

Chebyshev’s inequality: Proof (cont’d)

Let’s sum all three contributions \[ \text{Var}[X] = \sigma^2 \geq (k\sigma)^2 \left( \int_{-\infty}^{\mu-k\sigma} \text{d}s\, f(s) + \int_{\mu+k\sigma}^{\infty} \text{d}s\, f(s) \right) \]

\[ \frac 1{k^2} \geq P(X \leq \mu - k\sigma) + P(X \geq \mu + k\sigma) = P(\vert X-\mu \vert \geq k\sigma) \]

Note

Chebyshev’s inequality is useful to establish some loose bounds on the probability for an event. Useful when the underlying distribution is unkown / difficult to evalute.

Moment-generating functions

Two types of moments!

Moment about the origin: The \(k\)th moment of an RV \(Y\) taken about the origin is defined as \[E[Y^k] = \mu'_k\]
Centered moment: The \(k\)th moment of an RV \(Y\) taken about its mean is defined as \[E[(Y-\mu)^k] = \mu_k\]

Moments define the distribution

Statement¹

Given two distributions \(Y\) and \(Z\) with identical sets of moments: \[ \begin{align*} \mu'_{1Y} =& \mu'_{1Z}\\ \mu'_{2Y} =& \mu'_{2Z}\\ \mu'_{3Y} =& \mu'_{3Z}\\ &\vdots \end{align*} \] the distributions are identical.

Moment-generating function

Definition

The moment-generating function \(m(t)\) for an RV \(Y\) is defined as \[ m_Y(t) = E[\text{e}^{tY}] \]

\(m(t)\) is said to exist if it is finite for some \(|t| \leq b\) where \(b\) is a positive constant.

Discrete distribution

\[ m_Y(t) = E[\text{e}^{tY}] = \sum_y \text{e}^{ty} p(y) \]

Continuous distribution

\[ m_Y(t) = E[\text{e}^{tY}] = \int \text{d}y \, \text{e}^{ty} f(y) \]

Moment-generating function

Why is it called moment-generating function? Consider the series expansion of \(\text{e}^{tY}\) \[ \text{e}^{ty} = 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \]

Assuming that \(\mu'_k\) is finite for \(k=1,2,3,\dots\), we have \[ E[\text{e}^{tY}] = \sum_y \text{e}^{ty} p(y) = \sum_y \left[ 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \right] p(y) \]

Moment-generating function

\[ \begin{align} m_Y(t) &= E[\text{e}^{tY}] = \sum_y \left[ 1 + ty + \frac{(ty)^2}{2!} + \frac{(ty)^3}{3!} + \dots \right] p(y) \\ &= \sum_y p(y) + t\sum_y p(y) + \frac{t^2}{2!} \sum_y y^2 p(y) + \frac{t^3}{3!} \sum_y y^3 p(y) + \dots \\ &= 1 + tE[Y] + \frac{t^2}{2!} E[Y^2] + \frac{t^3}{3!} E[Y^3] + \dots \end{align} \]

\[ \boxed{\left.\frac{\text{d}^l m_X(t)}{dt^l}\right|_{t=0} = E[X^l]} \]

Moment-generating function for central moments

MGF for central moments

\[ m_{X-\mu}(t) = E[\text{e}^{t(X-\mu)}] = \int \text{d}x \, \text{e}^{t(x-\mu)} f(x) \]

Moment-generating function for sum of independent RVs

MGF for sum of RVs

Given two independent RVs, \(X\) and \(Y\), distributed as \(f(x)\) and \(g(y)\), the MGF of the sum yields \[ \begin{align*} m_{X+Y}(t) =& E[\text{e}^{t(X+Y)}] \\ =& \int \text{d}x\text{d}y \, \text{e}^{t(x+y)} f(x)g(y) \\ =& \int \text{d}x \, \text{e}^{tx} f(x) \int \text{d}y \, \text{e}^{ty} g(y) \\ =& m_X(t)m_Y(t) \end{align*} \]

The MGF of the sum of two independent RVs is the product of the individual MGFs.

Example: MGF for Poisson distribution¹

\[ \begin{align} E[\text{e}^{kX}] &= \sum_{x=0}^\infty \left( \frac{\lambda^x}{x!}\text{e}^{-\lambda} \right) \text{e}^{kx} \\ &= \text{e}^{-\lambda} \sum_{x=0}^\infty \frac{(\lambda \text{e}^k)^x}{x!} \\ &= \text{e}^{-\lambda} \text{e}^{\lambda \text{e}^k} \\ &= \text{e}^{\lambda(\text{e}^k-1)} \end{align} \]

Example: MGF for Poisson distribution

The first two moments about the origin are thus \[ \begin{align} E[x] &= \frac{\text{d}}{\text{d}k}[\text{e}^{\lambda(\text{e}^k-1)}] = \text{e}^{\lambda(\text{e}^k-1)}(\lambda \text{e}^k) \\ E[x^2] &= \frac{\text{d}^2}{\text{d}k^2}[\text{e}^{\lambda(\text{e}^k-1)}] = \text{e}^{\lambda(\text{e}^k-1)} [(\lambda \text{e}^k)^2 + \lambda \text{e}^k] \end{align} \]

\[ \begin{align} \mu &= \left.\text{e}^{\lambda(\text{e}^k-1)}(\lambda \text{e}^k)\right|_{k=0} = \color{red}\lambda\\ \mu'_2 &= \left.\text{e}^{\lambda(\text{e}^k-1)} [(\lambda \text{e}^k)^2 + \lambda \text{e}^k]\right|_{k=0} = \lambda^2 + \lambda \\ \sigma^2 &= E[Y^2]-\mu^2 = \mu'_2 - \mu^2 = \color{red}\lambda \end{align} \]

Central limit theorem

Consider \(n\) iid RVs with mean \(\mu\) and variance \(\sigma^2 < \infty\) and unknown probability distribution function.

Central limit theorem

Define the RV \[ Y := \frac{\sum_{i=1}^n x_i - n\mu}{\sigma \sqrt n} = \frac{\hat x - \mu}{\sigma / \sqrt{n}}. \] \(Y\) tends to a standard normal distribution, \(Y\sim\mathcal{N}(0,1)\), for \(n\to\infty\).

Central limit theorem: Proof

Define the normal variables \(z_i = \frac{x_i - \mu}\sigma\), s.t. \(\langle z_i \rangle = 0\) and \(\langle z_i^2 \rangle = 1\). \[ Y = \frac 1{\sqrt{n}} \sum_i z_i \]

The moment-generating function of \(Y\) is \[ m_Y(t) = \langle \text{e}^{Yt} \rangle = \langle \text{e}^{z_it / \sqrt{n}} \rangle^n \]

Central limit theorem: Proof (cont’d)

\[ \begin{align*} m_Y(t) &= \langle \text{e}^{Yt} \rangle = \langle \text{e}^{z_it / \sqrt{n}} \rangle^n \\ &= \langle 1 + \frac{z_it}{\sqrt{n}} + \frac{z_i^2t^2}{2!n} + \dots \rangle^n \\ &= \left(1 + \frac{\langle z_i\rangle t}{\sqrt{n}} + \frac{\langle z_i^2\rangle t^2}{2!n} + \dots \right)^n \\ &= \left(1 + \frac{t^2}{2n} + \dots \right)^n \approx 1 + \frac {t^2}2 \approx \text{e}^{\frac 12 t^2} \end{align*} \] where we assume \(t \ll 1\), i.e., near the origin. We obtain the MGF of a Normal variable.

Central limit theorem: illustration

MGF for a normal Gaussian distribution (blue thick line); normalized sum of independent uniform RVs (red thin lines) for \(n=1,3,5\) from bottom up. (Fig 2.6 in Amendola (2021))

Central limit theorem: impact

CLT guarantees that if the errors in a measure are the results of many independent errors due to various parts of the experiment, then they are expected to be distributed in a Gaussian way.

Multivariate probability distributions

Joint probability function

Joint probability function: We consider multiple RVs simultaneously (jointly)¹ \[ \begin{align*} &p(x_1, \dots, x_k) = p(X_1 = x_1, \dots, X_k = x_k) \geq 0 \\ &\sum_{x_1}\dots\sum_{x_k} p(x_1, \dots, x_k) = 1 \end{align*} \]

Joint distribution function¹

Joint distribution function

\[ F(x_1, x_2) = \int_{-\infty}^{x_1} \text{d}t_1 \int_{-\infty}^{x_2} \text{d}t_2 \, f(t_1, t_2) \]

\(F(-\infty, -\infty) = F(-\infty, x_2) = F(x_1, -\infty) = 0\)
\(F(\infty, \infty) = 1\)
if \(x_1^* \geq x_1\) and \(x_2^* \geq x_2\), then (see figure) \[F(x_1^*, x_2^*) + F(x_1, x_2) \geq F(x_1^*, x_2) + F(x_1, x_2^*)\]

Multi-categorical probability distributions

Multi-categorical distribution: Extends the Bernoulli distribution to more than two distinct outcomes (e.g., customer choices from a lunch menu). Often represented through \(K\)-dimensional binary vectors \({\bf x}\) with components \(x_i \in \{0,1\}\), \(i=1,\dots,K\), \(\sum_i x_i = 1\). Probability function: \[ p({\bf x} | {\bf \pi}) = \prod_{i=1}^K \pi_i^{x_i} \]

Multi-categorical probability distributions

We have a vector of expectations and a \(K\times K\) matrix of variances and covariances \[ \begin{align} E[X_i] &= \pi_i \\ \text{Var}[X_i] &= \pi_i ( 1 - \pi_i)\\ \text{Cov}[X_i, X_j] &= -\pi_i \pi_j \end{align} \]

Multinomial probability distributions

Multinomial distribution: Extends multi-categorical with the Binomial, i.e., multiple iid with more than two categories. Probability function \[ p(x_1,\dots,x_K) = \frac{N!}{x_1!\dots x_K!}\prod_{i=1}^K \pi_i^{x_i} \]

Multinomial probability distributions

Expected values, variances, and covariances¹: \[ \begin{align} E[X_i] &= N\pi \\ \text{Var}[X_i] &= N\pi_i(1-\pi_i) \\ \text{Cov}[X_i, X_j] &= -N\pi_i \pi_j, \ \text{for}\ i \neq j \end{align} \]

Multivariate normal distribution

Multivariate normal distribution: Consider multiple continuous RVs / features simultaneously, \({\bf x} = (x_1, \dots, x_p)^\intercal\). The distribution with mean vector \({\bf \mu}\) and \(p \times p\) covariance matrix \({\bf \Sigma}\) takes the form \[ f({\bf x}) = \frac 1{(2\pi)^{p/2} \vert {\bf \Sigma}\vert^{1/2}}\text{e}^{-\frac 12 ({\bf x} - {\bf \mu})^\intercal {\bf \Sigma}^{-1} ({\bf x} - {\bf \mu})} \] often abbreviated as \({\bf x}\sim \mathcal{N}({\bf \mu}, {\bf \Sigma})\).

Multivariate normal distribution

Parameters: \({\bf \mu}\) and \({\bf \Sigma}\) are parameters of the distribution, with entries \(\mu_i = E[X_i]\) and \(\Sigma_{ij} = \sigma_{ij}^2 = \text{Cov}[X_i, X_j]\).

Summary

Chebyshev’s inequality

loose bound on tail/extreme events when only the first two moments of a distribution are known

Moment-generating function

Series expansion of the exponential function; Easily extract moments about the origin

Multivariate probability distributions

Multi-categorical extends Bernoulli to \(k\) categories
Multinomial extends Binomial to \(k\) categories
Multivariate normal distribution: simple form

References

Amendola, Luca. 2021. “Lecture Notes on Statistical Methods.” https://www.thphys.uni-heidelberg.de/%7Eamendola/teaching/compstat-hd.pdf.

Wackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Error propagation

Transformation of variables

Transformation of variables: multidimensional case

Error propagation

Error propagation

Mean

Error propagation

Second moment about the origin

Variance

Error propagation

Extension to several independent variables

Additivity of the variance

Variance of the sample mean

Chebyshev’s inequality

Chebyshev’s inequality:1 Setting

Chebyshev’s inequality: Statement

Chebyshev’s inequality: Illustration

Chebyshev’s inequality: Proof1

Chebyshev’s inequality: Proof (cont’d)

Chebyshev’s inequality: Proof (cont’d)

Chebyshev’s inequality: Proof (cont’d)

Moment-generating functions

Two types of moments!

Moments define the distribution

Moment-generating function

Discrete distribution

Continuous distribution

Moment-generating function

Moment-generating function

Moment-generating function for central moments

Moment-generating function for sum of independent RVs

Example: MGF for Poisson distribution1

Example: MGF for Poisson distribution

Central limit theorem

Central limit theorem

Central limit theorem: Proof

Central limit theorem: Proof (cont’d)

Central limit theorem: illustration

Central limit theorem: impact

Multivariate probability distributions

Joint probability function

Joint distribution function1

Multi-categorical probability distributions

Multi-categorical probability distributions

Multinomial probability distributions

Multinomial probability distributions

Multivariate normal distribution

Multivariate normal distribution

Summary

Summary

References

Chebyshev’s inequality:¹ Setting

Chebyshev’s inequality: Proof¹

Example: MGF for Poisson distribution¹

Joint distribution function¹