Computational Statistics & Data Analysis (MVComp2)

Lecture 3: Discrete & continuous distributions

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Today’s lecture: Based on: Ch. 2-6 in Wackerly, Mendenhall, and Scheaffer (2014)

Recap from last time

Random variable maps event onto numbers
Def. of expected value (population mean) and variance for discrete and continuous probability distributions
Linearity of expectations makes combinations easy
Conditional expectations
Constant shift does not affect the variance
Covariance and correlation to relate two/many random variables
Sample mean and variance as empirical measures

Discrete probability distributions

Intro: Discrete distributions

Probability distribution: Function that assigns probabilities to all possible outcomes of a random variable

For a discrete probability distribution \(P(X)\), RV \(X\) is discrete, i.e., countable outcomes
\(P(X=x) = p(x)\) is the probability for the set of elementary outcomes for which \(X=x\)
\(p(x) \geq 0\) and \(\sum_{x_i} p(x_i) = 1\)
Discrete prob. distribution is called probability mass function (PMF)

Intro: Discrete distributions

Cumulative distribution function (CDF): \[ F(x_0) := P(X \leq x_0) = \sum_{x \leq x_0} P(x) \]

Uniform distribution

Uniform distribution: Assigns the same probability \(p(x) = 1/N\) to each of \(N\) possible outcomes \(x\).

e.g., fair die.

Bernoulli distribution

For a binary RV \(X \in \{ 0, 1\}\) (e.g., toss a coin or open/close configuration of an ion channel), the Bernoulli distribution assigns the probability \[ P(x) = \pi^x (1-\pi)^{1-x} \]

such that we observe

outcome 1 with probability \(P(X=1) := \pi\)
outcome 0 with probability \(P(X=0) := 1-\pi\)

\(\pi\) is the parameter of the distribution.

Bernoulli distribution

We have \[ E[X] = \pi \cdot 1 + (1-\pi) \cdot 0 = \pi = P(X=1) \] \[ \begin{align*} \text{Var}[X] &= E\left[(X-\mu)^2\right] \\ & = E[X^2] - E[X]^2 \\ & = \pi\cdot 1^2 + (1-\pi) \cdot 0^2 - \pi^2 \\ &= \pi(1-\pi) \end{align*} \]

Binomial distribution

Binomial distribution: Results from \(N\) identical and independent repetitions of a Bernoulli experiment. Yields a sample space of binary \(N\)-tuples.
Independently and identically distributed: plays an important role in statistics, also called iid

Probability of a single \(N\)-tuple with \(k\) 1’s and \((N-k)\) 0’s: \[ \pi^k (1-\pi)^{N-k} \]

Binomial distribution

Typically we are interested in the number \(X\) of hits within the \(N\) observations: indistinguishability!¹

Sum up all the combinations of hits within the observations. Number of \(N\)-tuples with exactly \(X=k\) hits: \(N \choose k\)

Binomial probability distribution: \[ P(X=k) = {N \choose k} \pi^k (1-\pi)^{N-k} \]

Binomial distribution

Notation

For a RV \(X\) drawn from a binomial distribution with given parameters \(\pi\) and \(N\) we also write \[ X \sim B(\pi, N) \] where the tilde stands for “distributed according to.”

Binomial distribution

Expected value

\[ E[X] = N\pi \]

Variance

\[ \text{Var}[X] = N\pi(1-\pi) \]

Proof

Linearity of expectations and iid property of the \(N\) observations. See also T3.7 in Wackerly, Mendenhall, and Scheaffer (2014).

Binomial distribution

Cumulative distribution function

\[ F(K) = \sum_{k=0}^K {N \choose k} \pi^k (1-\pi)^{N-k} \]

Geometric distribution

Similar to binomial: defined on strings of binary outcomes. But only sequences that have all 0’s except for the last entry, which is a 1, i.e., \(A_1 = (1), A_2=(0\ 1), A_3 = (0\ 0\ 1), \dots\)

\(A_k\): has \((k-1)\) 0’s before the first hit

Intuition

Experiment is repeated until the first hit. Can represent waiting times for certain events to arrive in fixed-time intervals.

Geometric distribution

Probability distribution function

Follows sequence of \(k\) iid Bernoulli experiments¹ \[ P(A_k) = P(X=k) = (1-\pi)^{k-1}\pi \]

Geometric distribution

Cumulative distribution function

\[ F(k) = 1-(1-\pi)^k \]

Proof

\[ F(k) := P(X\leq k) = \sum_{i=1}^k (1-\pi)^{i-1}\pi = \frac{1-(1-\pi)^k}{1-(1-\pi)} \pi = 1-(1-\pi)^k \]

Geometric distribution

Expectation

\[ E[X] = \frac 1\pi \]

Variance

\[ \text{Var}[X] = \frac{1-\pi}{\pi^2} \]

Hypergeometric distribution

Binomial: draws with replacement
Hypergeometric: number of successes from a sequence of draws without replacement, e.g., colored balls from an urn, adsorption of specific molecules on a surface

Hypergeometric distribution

Select \(n\) elements from a population of \(N\): prob of sample point: \(1/{N \choose n}\). Probability of drawing \(y\) red balls?

\(y\) red balls means drawing \(n-y\) white balls
Total number of red balls in the urn: \(r\)
Product rule:
- select \(y\) red balls: \(r \choose y\)
- select \(n-y\) white balls: \(N-r \choose n-y\)

\[ P(y) = \frac{{r \choose y}{N-r \choose n-y}}{N \choose n} \]

Hypergeometric distribution

Expected value

\[ E[Y] = \frac{nr}{N} \]

Variance

\[ \text{Var}[Y] = n\frac{r}{N}\frac{N-r}{N}\frac{N-n}{N-1} \]

Poisson distribution

Poisson distribution: Derived from the Binomial in the limit of \(N\to \infty\) and \(\pi \to 0\). Applications: counting events (like car accidents or particles hitting a surface) in small time intervals \(\Delta t \to 0\).

\[ P(X=k) = \lim_{N\to\infty} {N \choose k} \pi^k (1-\pi)^{N-k} \]

Poisson distribution

Define \(\lambda := N\pi\)

\[ \begin{align} & \lim_{N\to\infty} {N \choose k} \pi^k (1-\pi)^{N-k} \\ &= \lim_{N\to\infty} \frac{N(N-1)\dots(N-k+1)}{k!}\left(\frac{\lambda}{N} \right)^k\left(1-\frac{\lambda}{N} \right)^{N-k} \\ &= \frac{\lambda^k}{k!} \lim_{N\to\infty} \left(1-\frac{\lambda}{N} \right)^N \left(1-\frac{\lambda}{N} \right)^{-k} \left( 1 - \frac{1}{N} \right) \dots \left( 1 - \frac{k-1}{N} \right) \\ &= \frac{\lambda^k}{k!} \text{e}^{-\lambda} \end{align} \]

Poisson distribution

Expectation & variance¹

\[ E[X] = \text{Var}[X] = \lambda \]

Poisson distribution

Continuous probability distributions

Intro: Continuous distributions

Probability distribution function (PDF): Function that assigns probabilities to all possible outcomes of a random variable

For a continuous probability distribution \(P(X)\), RV \(X\) is continuous, i.e., infinitely many outcomes.

Example application: amount of rainfall in some area.

Cumulative distribution function

CDF is necessary to define a continuous distribution function \[ F(x) := p(X\leq x) = \int_{-\infty}^x \text{d}u\,f(u) \]

\(F(x)\) continuously differentiable (except finitely many points)
\(\lim_{x\to-\infty}F(x) = 0\), \(\lim_{x\to\infty}F(x) = 1\)
\(F(x)\) must be monotonic, i.e., \(x_1 < x_2 \Rightarrow F(x_1) \leq F(x_2)\)

Probability density function

Assign probability to an interval \([x_0, x_1]\) \[ p(x_0 < X \leq x_1) = F(x_1) - F(x_0) = \int_{x_0}^{x_1} \text{d}u\, f(u) \]

Definition

\[f(x) = \frac{\text{d}F(x)}{\text{d}x}\]

\(f(x) \geq 0\)
\(\int_{-\infty}^\infty \text{d}x\,f(x) = 1\)

Quantiles

Quantile of the distribution

Quantiles: Cut points dividing the range of a probability distribution into continuous intervals with equal probabilities.

Conditional distribution function

\[ F(x|y) := P(X \leq x | Y = y) = \int_{-\infty}^x \text{d}u\, f_x(u|y) = \int_{-\infty}^x \text{d}u\, \frac{f_x(u, y)}{f_y(y)} \]

Marginal density

\[ F(x) = \int_{-\infty}^\infty \text{d}u\, F(x|u)f_y(u) \]

Independent random variables

Independent RVs \(X\) and \(Y\) we get \[ F(x,y) = F_x(x) F_y(y) \]

\[ f(x,y) = f_x(x)f_y(y) \]

Other properties

Expectancy & (co)variance

Defined analogously to the discrete case, replacing sums by integrals.

Rules of linearity

Also hold for the continuous case.

Uniform distribution

\[ f(x) := \left\{ \begin{align} \frac{1}{\theta_2-\theta_1}, & \quad x\in [\theta_1, \theta_2], \theta_1 < \theta_2\\ 0, & \quad \text{else}\\ \end{align} \right. \]

\[ E[X] = \frac{\theta_1 + \theta_2}{2} \]

\[ \text{Var}[X] = \frac{(\theta_2-\theta_1)^2}{12} \]

Uniform distr. is an important refrerence for generating random numbers and statistical test scenarios.

Gaussian (normal) distribution¹

Consider parameters \(-\infty < \mu < \infty\) and \(\sigma > 0\), the normal distribution reads \[ f(x) = \frac{1}{\sqrt{2\pi}\sigma} \text{e}^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

\[ E[X] = \mu \]

\[ \text{Var}[X] = \sigma^2 \]

(Proof: Set standardized variable \(z = \frac{x-\mu}\sigma\) and compute \(E[x] = \int_{-\infty}^\infty \text{d}z \, f(z)\)…)

Standard normal distribution

Notation

Normal distribution often denoted \(\mathcal{N}(\mu, \sigma^2)\)

Standard normal distribution: variable \(z\) in last slide corresponds to zero mean and unit variance: \(\mathcal{N}(0, 1)\)
Transform from standard normal to any normal \(\mathcal{N}(\mu, \sigma^2)\): \(x = \sigma z + \mu\)

Normal distribution

Normal distribution is always symmetric around its mean, which is the same as its mode (unique maximum). As a result, the mean also equals the median

Beta distribution

Bounded distributions in the interval \(x \in [0,1]\) (i.e., Bernoulli-type experiments) \[ f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} \] where we define the Gamma function \[ \Gamma(\alpha) := \int_0^\infty \text{d}x\, x^{\alpha-1}\text{e}^{-x} \]

Beta distribution

Bounded distributions in the interval \(x \in [0,1]\) (i.e., Bernoulli-type experiments) \[ f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} \] Notice the similarity with the binomial distribution. Beta is the conjugate prior for the binomial in a Bayesian approach: when used as a prior distribution and multiplied with a binomial, it yields a Beta distribution as posterior!

Beta distribution

\[ E[X] = \frac{\alpha}{\alpha+\beta} \]

\[ \text{Var}[X] = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]

Gamma distribution

Important distribution. Produces skewed distributions. \[ f(x) = \left\{ \begin{align} \frac{x^{\alpha-1}\text{e}^{-\frac x\beta}}{\beta^\alpha \Gamma(\alpha)}, & \quad 0 \leq x < \infty\\ 0, &\quad \text{else} \\ \end{align} \right. \] where \(\alpha\) affects the shape of the Gamma distribution (called shape parameter) and \(\beta\) affects the scale (called scale parameter).

Gamma distribution

\(\alpha\): shape parameter
\(\beta\): scale parameter

Gamma distribution

\[ E[X] = \alpha\beta \]

\[ \text{Var}[X] = \alpha\beta^2 \]

Proof

\[ \begin{align} E[X] &= \int_0^\infty \text{d}x\, x\frac{x^{\alpha-1}\text{e}^{-\frac x\beta}}{\beta^\alpha \Gamma(\alpha)} = \frac 1{\beta^\alpha \Gamma(\alpha)} \int_0^\infty \text{d}x\, x^\alpha \text{e}^{-\frac x\beta} \\ &= \frac 1{\beta^\alpha \Gamma(\alpha)} \beta^{\alpha+1} \Gamma(\alpha+1) = \frac{\beta \alpha\Gamma(\alpha)}{\Gamma(\alpha)} = \alpha\beta \end{align} \] where we use the identity \(\Gamma(\alpha+1) = \alpha\Gamma(\alpha)\).

Gamma distribution

Gamma can reduce to other distributions: choices of \(\alpha\) and \(\beta\) can lead to a sum of Poisson distribution, a \(\chi^2\) distribution,¹ or an exponential distribution.

Exponential distribution

Start from Gamma distribution, set \(\alpha=1\)¹: \[ f(x) = \left\{ \begin{align} \frac 1\beta \text{e}^{-\frac x\beta}, & \quad 0 \leq x < \infty\\ 0, &\quad \text{else} \\ \end{align} \right. \] where from above we infer \(E[X]=\sqrt{\text{Var}[X]} = \beta\).

Exponential distribution

Applications: decay processes, like the lifetime of electrical components.
Events generated according to Poisson process, inter-event-intervals follow exponential distribution.
Exponential (and geometric) distributions are memoryless, the distribution does not depend on how long we have been waiting for an event!¹

\[ P(X > t_0 + t_1 | X > t_0) = P(X > t_1) \]

Exponential family distributions & conjugate priors

Exponential family distributions¹

Definition

Broader class of exponential family distributions, which share certain properties. We can write the distribution as \[ p(x|\eta) = \frac{h(x)}{Z(\eta)} \exp[\eta^\intercal T(x)] = h(x) \exp[\eta^\intercal T(x) - A(\eta)] \]

\(h(\eta)\) is a scaling constant
\(\eta\) are the natural parameters
\(T(x)\) are the sufficient statistics,
\(A(\eta) = \log Z(\eta)\) is the log partition function

Exponential family: Bernoulli

Rewrite the Bernoulli distribution

\[ \begin{align} P(x|\mu) &= \mu^x (1-\mu)^{1-x} \\ &= \exp[x \log(\mu) + (1-x) \log(1-\mu)] \\ &= \exp\left[x \log\left(\frac\mu{1-\mu}\right) + \log(1-\mu)\right] \\ &= \exp[T(x)\eta - A(\eta)] \end{align} \] such that \(T(x)=x\), \(\eta = \log(\frac \mu{1-\mu})\), \(h(x)=1\).

Log partition function is cumulant generating function

Important property

Exponential family: derivatives of the log partition function can be used to generate all the cumulants¹ of the sufficient statistics. For the first and second cumulants we obtain \[ \begin{align*} \nabla A(\eta) =& E[T(x)] \\ \nabla^2 A(\eta) =& \text{Cov}[T(x)] \end{align*} \] From the second equation we conclude that the Hessian is positive definite, and hence \(A(\eta)\) is convex in \(\eta\). Also, log likelihood is concave.

Conjugate prior

Consider subset of (prior, likelihood) pairs for which we can compute the posterior in closed form.

Conjugate prior: a prior \(p(\theta) \in \mathcal{F}\) is a conjugate prior for a likelihood function \(p(\mathcal{D}|\theta)\) if the posterior is in the same parameterized family as the prior, i.e., \(p(\theta | \mathcal{D}) \in \mathcal{F}\).

If the family \(\mathcal{F}\) corresponds to the exponential family, then the computations can be performed in closed form.

Examples: Beta-binomial and Gaussian-Gaussian

Summary

Discrete probability distributions

Uniform, Bernoulli, Binomial, Geometric, Hypergeometric, Poisson

Continuous probability distributions

prob distr. function as derivative of CDF
Uniform, Gaussian (normal), Beta, Gamma, Exponential

Exponential family distributions

Log partition function is cumulant generating function
Conjugate prior: consistent prior, likelihood, and posterior

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Wackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Discrete probability distributions

Intro: Discrete distributions

Intro: Discrete distributions

Uniform distribution

Bernoulli distribution

Bernoulli distribution

Binomial distribution

Binomial distribution

Binomial distribution

Binomial distribution

Expected value

Variance

Binomial distribution

Cumulative distribution function

Geometric distribution

Geometric distribution

Probability distribution function

Geometric distribution

Cumulative distribution function

Geometric distribution

Expectation

Variance

Hypergeometric distribution

Hypergeometric distribution

Hypergeometric distribution

Expected value

Variance

Poisson distribution

Poisson distribution

Poisson distribution

Expectation & variance1

Poisson distribution

Continuous probability distributions

Intro: Continuous distributions

Cumulative distribution function

Probability density function

Quantiles

Conditional distribution function

Marginal density

Other properties

Expectancy & (co)variance

Rules of linearity

Uniform distribution

Gaussian (normal) distribution1

Standard normal distribution

Normal distribution

Beta distribution

Beta distribution

Beta distribution

Beta distribution

Gamma distribution

Gamma distribution

Gamma distribution

Gamma distribution

Exponential distribution

Exponential distribution

Exponential family distributions & conjugate priors

Exponential family distributions1

Exponential family: Bernoulli

Log partition function is cumulant generating function

Conjugate prior

Summary

Summary

References

Expectation & variance¹

Gaussian (normal) distribution¹

Exponential family distributions¹