Computational Statistics & Data Analysis (MVComp2)

Lecture 3: Discrete & continuous distributions

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Today’s lecture
Based on: Ch. 2-6 in Wackerly, Mendenhall, and Scheaffer (2014)

Recap from last time

  • Random variable maps event onto numbers
  • Def. of expected value (population mean) and variance for discrete and continuous probability distributions
  • Linearity of expectations makes combinations easy
  • Conditional expectations
  • Constant shift does not affect the variance
  • Covariance and correlation to relate two/many random variables
  • Sample mean and variance as empirical measures

Discrete probability distributions

Intro: Discrete distributions

Probability distribution
Function that assigns probabilities to all possible outcomes of a random variable
  • For a discrete probability distribution \(P(X)\), RV \(X\) is discrete, i.e., countable outcomes
  • \(P(X=x) = p(x)\) is the probability for the set of elementary outcomes for which \(X=x\)
  • \(p(x) \geq 0\) and \(\sum_{x_i} p(x_i) = 1\)
  • Discrete prob. distribution is called probability mass function (PMF)

Intro: Discrete distributions

Cumulative distribution function (CDF)
\[ F(x_0) := P(X \leq x_0) = \sum_{x \leq x_0} P(x) \]

Uniform distribution

Uniform distribution
Assigns the same probability \(p(x) = 1/N\) to each of \(N\) possible outcomes \(x\).

e.g., fair die.

Bernoulli distribution

For a binary RV \(X \in \{ 0, 1\}\) (e.g., toss a coin or open/close configuration of an ion channel), the Bernoulli distribution assigns the probability \[ P(x) = \pi^x (1-\pi)^{1-x} \]

such that we observe

  • outcome 1 with probability \(P(X=1) := \pi\)
  • outcome 0 with probability \(P(X=0) := 1-\pi\)

\(\pi\) is the parameter of the distribution.

Bernoulli distribution

We have \[ E[X] = \pi \cdot 1 + (1-\pi) \cdot 0 = \pi = P(X=1) \] \[ \begin{align*} \text{Var}[X] &= E\left[(X-\mu)^2\right] \\ & = E[X^2] - E[X]^2 \\ & = \pi\cdot 1^2 + (1-\pi) \cdot 0^2 - \pi^2 \\ &= \pi(1-\pi) \end{align*} \]

Binomial distribution

Binomial distribution
Results from \(N\) identical and independent repetitions of a Bernoulli experiment. Yields a sample space of binary \(N\)-tuples.
Independently and identically distributed
plays an important role in statistics, also called iid

Probability of a single \(N\)-tuple with \(k\) 1’s and \((N-k)\) 0’s: \[ \pi^k (1-\pi)^{N-k} \]

Binomial distribution

Typically we are interested in the number \(X\) of hits within the \(N\) observations: indistinguishability!1

Sum up all the combinations of hits within the observations. Number of \(N\)-tuples with exactly \(X=k\) hits: \(N \choose k\)

Binomial probability distribution
\[ P(X=k) = {N \choose k} \pi^k (1-\pi)^{N-k} \]

Binomial distribution

Notation

For a RV \(X\) drawn from a binomial distribution with given parameters \(\pi\) and \(N\) we also write \[ X \sim B(\pi, N) \] where the tilde stands for “distributed according to.”

Binomial distribution

Expected value

\[ E[X] = N\pi \]

Variance

\[ \text{Var}[X] = N\pi(1-\pi) \]

Proof

Linearity of expectations and iid property of the \(N\) observations. See also T3.7 in Wackerly, Mendenhall, and Scheaffer (2014).

Binomial distribution

Cumulative distribution function

\[ F(K) = \sum_{k=0}^K {N \choose k} \pi^k (1-\pi)^{N-k} \]

Geometric distribution

Similar to binomial: defined on strings of binary outcomes. But only sequences that have all 0’s except for the last entry, which is a 1, i.e., \(A_1 = (1), A_2=(0\ 1), A_3 = (0\ 0\ 1), \dots\)

\(A_k\)
has \((k-1)\) 0’s before the first hit

Intuition

Experiment is repeated until the first hit. Can represent waiting times for certain events to arrive in fixed-time intervals.

Geometric distribution

Probability distribution function

Follows sequence of \(k\) iid Bernoulli experiments1 \[ P(A_k) = P(X=k) = (1-\pi)^{k-1}\pi \]

Geometric distribution

Cumulative distribution function

\[ F(k) = 1-(1-\pi)^k \]

Proof

\[ F(k) := P(X\leq k) = \sum_{i=1}^k (1-\pi)^{i-1}\pi = \frac{1-(1-\pi)^k}{1-(1-\pi)} \pi = 1-(1-\pi)^k \]

Geometric distribution

Expectation

\[ E[X] = \frac 1\pi \]

Variance

\[ \text{Var}[X] = \frac{1-\pi}{\pi^2} \]

Hypergeometric distribution

  • Binomial: draws with replacement
  • Hypergeometric: number of successes from a sequence of draws without replacement, e.g., colored balls from an urn, adsorption of specific molecules on a surface

Hypergeometric distribution

Select \(n\) elements from a population of \(N\): prob of sample point: \(1/{N \choose n}\). Probability of drawing \(y\) red balls?

  • \(y\) red balls means drawing \(n-y\) white balls
  • Total number of red balls in the urn: \(r\)
  • Product rule:
    • select \(y\) red balls: \(r \choose y\)
    • select \(n-y\) white balls: \(N-r \choose n-y\)


\[ P(y) = \frac{{r \choose y}{N-r \choose n-y}}{N \choose n} \]

Hypergeometric distribution

Expected value

\[ E[Y] = \frac{nr}{N} \]

Variance

\[ \text{Var}[Y] = n\frac{r}{N}\frac{N-r}{N}\frac{N-n}{N-1} \]

Poisson distribution

Poisson distribution
Derived from the Binomial in the limit of \(N\to \infty\) and \(\pi \to 0\). Applications: counting events (like car accidents or particles hitting a surface) in small time intervals \(\Delta t \to 0\).

\[ P(X=k) = \lim_{N\to\infty} {N \choose k} \pi^k (1-\pi)^{N-k} \]

Poisson distribution

Define \(\lambda := N\pi\)

\[ \begin{align} & \lim_{N\to\infty} {N \choose k} \pi^k (1-\pi)^{N-k} \\ &= \lim_{N\to\infty} \frac{N(N-1)\dots(N-k+1)}{k!}\left(\frac{\lambda}{N} \right)^k\left(1-\frac{\lambda}{N} \right)^{N-k} \\ &= \frac{\lambda^k}{k!} \lim_{N\to\infty} \left(1-\frac{\lambda}{N} \right)^N \left(1-\frac{\lambda}{N} \right)^{-k} \left( 1 - \frac{1}{N} \right) \dots \left( 1 - \frac{k-1}{N} \right) \\ &= \frac{\lambda^k}{k!} \text{e}^{-\lambda} \end{align} \]

Poisson distribution

Expectation & variance1

\[ E[X] = \text{Var}[X] = \lambda \]

Poisson distribution

Continuous probability distributions

Intro: Continuous distributions

Probability distribution function (PDF)
Function that assigns probabilities to all possible outcomes of a random variable

For a continuous probability distribution \(P(X)\), RV \(X\) is continuous, i.e., infinitely many outcomes.

Example application: amount of rainfall in some area.

Cumulative distribution function

CDF is necessary to define a continuous distribution function \[ F(x) := p(X\leq x) = \int_{-\infty}^x \text{d}u\,f(u) \]

  • \(F(x)\) continuously differentiable (except finitely many points)
  • \(\lim_{x\to-\infty}F(x) = 0\), \(\lim_{x\to\infty}F(x) = 1\)
  • \(F(x)\) must be monotonic, i.e., \(x_1 < x_2 \Rightarrow F(x_1) \leq F(x_2)\)

Probability density function

Assign probability to an interval \([x_0, x_1]\) \[ p(x_0 < X \leq x_1) = F(x_1) - F(x_0) = \int_{x_0}^{x_1} \text{d}u\, f(u) \]

Definition

\[f(x) = \frac{\text{d}F(x)}{\text{d}x}\]

  • \(f(x) \geq 0\)
  • \(\int_{-\infty}^\infty \text{d}x\,f(x) = 1\)

Quantiles

Quantile of the distribution

Quantiles
Cut points dividing the range of a probability distribution into continuous intervals with equal probabilities.

Conditional distribution function

\[ F(x|y) := P(X \leq x | Y = y) = \int_{-\infty}^x \text{d}u\, f_x(u|y) = \int_{-\infty}^x \text{d}u\, \frac{f_x(u, y)}{f_y(y)} \]

Marginal density

\[ F(x) = \int_{-\infty}^\infty \text{d}u\, F(x|u)f_y(u) \]

Independent random variables

Independent RVs \(X\) and \(Y\) we get \[ F(x,y) = F_x(x) F_y(y) \]

\[ f(x,y) = f_x(x)f_y(y) \]

Other properties

Expectancy & (co)variance

Defined analogously to the discrete case, replacing sums by integrals.

Rules of linearity

Also hold for the continuous case.

Uniform distribution

\[ f(x) := \left\{ \begin{align} \frac{1}{\theta_2-\theta_1}, & \quad x\in [\theta_1, \theta_2], \theta_1 < \theta_2\\ 0, & \quad \text{else}\\ \end{align} \right. \]

\[ E[X] = \frac{\theta_1 + \theta_2}{2} \]

\[ \text{Var}[X] = \frac{(\theta_2-\theta_1)^2}{12} \]

Uniform distr. is an important refrerence for generating random numbers and statistical test scenarios.

Gaussian (normal) distribution1

Consider parameters \(-\infty < \mu < \infty\) and \(\sigma > 0\), the normal distribution reads \[ f(x) = \frac{1}{\sqrt{2\pi}\sigma} \text{e}^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

\[ E[X] = \mu \]

\[ \text{Var}[X] = \sigma^2 \]

(Proof: Set standardized variable \(z = \frac{x-\mu}\sigma\) and compute \(E[x] = \int_{-\infty}^\infty \text{d}z \, f(z)\)…)

Standard normal distribution

Notation

Normal distribution often denoted \(\mathcal{N}(\mu, \sigma^2)\)

Standard normal distribution
variable \(z\) in last slide corresponds to zero mean and unit variance: \(\mathcal{N}(0, 1)\)
Transform from standard normal to any normal \(\mathcal{N}(\mu, \sigma^2)\)
\(x = \sigma z + \mu\)

Normal distribution

Normal distribution is always symmetric around its mean, which is the same as its mode (unique maximum). As a result, the mean also equals the median

Beta distribution

Bounded distributions in the interval \(x \in [0,1]\) (i.e., Bernoulli-type experiments) \[ f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} \] where we define the Gamma function \[ \Gamma(\alpha) := \int_0^\infty \text{d}x\, x^{\alpha-1}\text{e}^{-x} \]

Beta distribution

Beta distribution

Bounded distributions in the interval \(x \in [0,1]\) (i.e., Bernoulli-type experiments) \[ f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} \] Notice the similarity with the binomial distribution. Beta is the conjugate prior for the binomial in a Bayesian approach: when used as a prior distribution and multiplied with a binomial, it yields a Beta distribution as posterior!

Beta distribution

\[ E[X] = \frac{\alpha}{\alpha+\beta} \]

\[ \text{Var}[X] = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]

Gamma distribution

Important distribution. Produces skewed distributions. \[ f(x) = \left\{ \begin{align} \frac{x^{\alpha-1}\text{e}^{-\frac x\beta}}{\beta^\alpha \Gamma(\alpha)}, & \quad 0 \leq x < \infty\\ 0, &\quad \text{else} \\ \end{align} \right. \] where \(\alpha\) affects the shape of the Gamma distribution (called shape parameter) and \(\beta\) affects the scale (called scale parameter).

Gamma distribution

  • \(\alpha\): shape parameter
  • \(\beta\): scale parameter

Gamma distribution

\[ E[X] = \alpha\beta \]

\[ \text{Var}[X] = \alpha\beta^2 \]

Proof

\[ \begin{align} E[X] &= \int_0^\infty \text{d}x\, x\frac{x^{\alpha-1}\text{e}^{-\frac x\beta}}{\beta^\alpha \Gamma(\alpha)} = \frac 1{\beta^\alpha \Gamma(\alpha)} \int_0^\infty \text{d}x\, x^\alpha \text{e}^{-\frac x\beta} \\ &= \frac 1{\beta^\alpha \Gamma(\alpha)} \beta^{\alpha+1} \Gamma(\alpha+1) = \frac{\beta \alpha\Gamma(\alpha)}{\Gamma(\alpha)} = \alpha\beta \end{align} \] where we use the identity \(\Gamma(\alpha+1) = \alpha\Gamma(\alpha)\).

Gamma distribution

Gamma can reduce to other distributions: choices of \(\alpha\) and \(\beta\) can lead to a sum of Poisson distribution, a \(\chi^2\) distribution,1 or an exponential distribution.

Exponential distribution

Start from Gamma distribution, set \(\alpha=1\)1: \[ f(x) = \left\{ \begin{align} \frac 1\beta \text{e}^{-\frac x\beta}, & \quad 0 \leq x < \infty\\ 0, &\quad \text{else} \\ \end{align} \right. \] where from above we infer \(E[X]=\sqrt{\text{Var}[X]} = \beta\).

Exponential distribution

  • Applications: decay processes, like the lifetime of electrical components.
  • Events generated according to Poisson process, inter-event-intervals follow exponential distribution.
  • Exponential (and geometric) distributions are memoryless, the distribution does not depend on how long we have been waiting for an event!1

\[ P(X > t_0 + t_1 | X > t_0) = P(X > t_1) \]

Exponential family distributions & conjugate priors

Exponential family distributions1

Definition

Broader class of exponential family distributions, which share certain properties. We can write the distribution as \[ p(x|\eta) = \frac{h(x)}{Z(\eta)} \exp[\eta^\intercal T(x)] = h(x) \exp[\eta^\intercal T(x) - A(\eta)] \]

  • \(h(\eta)\) is a scaling constant
  • \(\eta\) are the natural parameters
  • \(T(x)\) are the sufficient statistics,
  • \(A(\eta) = \log Z(\eta)\) is the log partition function

Exponential family: Bernoulli

Rewrite the Bernoulli distribution

\[ \begin{align} P(x|\mu) &= \mu^x (1-\mu)^{1-x} \\ &= \exp[x \log(\mu) + (1-x) \log(1-\mu)] \\ &= \exp\left[x \log\left(\frac\mu{1-\mu}\right) + \log(1-\mu)\right] \\ &= \exp[T(x)\eta - A(\eta)] \end{align} \] such that \(T(x)=x\), \(\eta = \log(\frac \mu{1-\mu})\), \(h(x)=1\).

Log partition function is cumulant generating function

Important property

Exponential family: derivatives of the log partition function can be used to generate all the cumulants1 of the sufficient statistics. For the first and second cumulants we obtain \[ \begin{align*} \nabla A(\eta) =& E[T(x)] \\ \nabla^2 A(\eta) =& \text{Cov}[T(x)] \end{align*} \] From the second equation we conclude that the Hessian is positive definite, and hence \(A(\eta)\) is convex in \(\eta\). Also, log likelihood is concave.

Conjugate prior

Consider subset of (prior, likelihood) pairs for which we can compute the posterior in closed form.

Conjugate prior
a prior \(p(\theta) \in \mathcal{F}\) is a conjugate prior for a likelihood function \(p(\mathcal{D}|\theta)\) if the posterior is in the same parameterized family as the prior, i.e., \(p(\theta | \mathcal{D}) \in \mathcal{F}\).

If the family \(\mathcal{F}\) corresponds to the exponential family, then the computations can be performed in closed form.

Examples: Beta-binomial and Gaussian-Gaussian

Summary

Summary

Discrete probability distributions
Uniform, Bernoulli, Binomial, Geometric, Hypergeometric, Poisson
Continuous probability distributions
  • prob distr. function as derivative of CDF
  • Uniform, Gaussian (normal), Beta, Gamma, Exponential
Exponential family distributions
  • Log partition function is cumulant generating function
  • Conjugate prior: consistent prior, likelihood, and posterior

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.
Wackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.