Lecture 6: Parameter estimation and hypothesis tests
Institute for Theoretical Physics, Heidelberg University
We momentarily focus on the frequentist approach to explore some results. Motivations:
We have \(N\) data \(x_i\) assumed to be Gaussian distributed (iid), \(\mathcal{N}\left(\mu, \sigma\right)\). The sample mean statistic \[ \hat x = \frac 1N \sum_i x_i \] will also be a Gaussian variable: \(\mathcal{N}\left(\mu, \frac{\sigma}{\sqrt N}\right)\) (from the CLT).1
Chi-squared distribution
Consider \(N\) iid Gaussian variables, we define \[ z = \sum_{i=1}^n \frac{(x_i - \mu_i)^2}{\sigma_i^2}. \] To find the PDF of \(z\) we do a variable transformation to spherical coordinates in \(N\) dimensions: a radius \(r^2 = z\) and \(N-1\) angles \(\theta_i\). Integration over the radial and angular components leads to a \(\chi^2\) distribution of the form1 \[ f(z; N) = \frac 1{2^{N/2}\Gamma(N/2)}z^{N/2}\text{e}^{-z/2} = \chi_N^2(z). \] Importantly, the first two moments are given by \[ \begin{align*} E[Z] &= N \\ \text{Var}[Z] &= 2N \end{align*} \]
What is the distribution of the sample variance \(S^2 = \frac 1{N-1}\sum_{i=1}^N (x_i - \hat x)^2\)?
We have \[ \begin{align*} (N-1)S^2 &= \sum_{i=1}^N \left[ (x_i - \mu) - (\hat x - \mu) \right]^2 \\ &= \sum_{i=1}^N (x_i - \mu)^2 - N(\hat x - \mu)^2 \\ (N-1)\frac{S^2}{\sigma^2} &= \underbrace{\sum_{i=1}^N \frac{(x_i - \mu)^2}{\sigma^2}}_{\chi^2_N} - \underbrace{\frac{(\hat x - \mu)^2}{\sigma^2 / N}}_{\chi^2_1} \end{align*} \] such that:1 2 \[ (N-1) \frac{S^2}{\sigma^2} \sim \chi^2_N - \chi^2_1 \sim \chi^2_{N-1} \]
Design RV of the form \(\text{mean}/\sqrt{\text{variance}}\): \[ T = \frac Z{\sqrt{X/\nu}} \] where \(Z \sim \mathcal{N}(0,1)\) and \(X \sim \chi^2_\nu\) are independent.
Write out the joint probability density function \[ f_{Z, X}(z, x) = f_Z(z) f_X(x) \] then introduce a change of variables, \(Z \to T = Z / \sqrt{X/\nu}\) while keeping \(X\) intact. Compute the Jacobian. Compute \(f_{T, X}(t,x)\). Integrate out \(X\). Obtain the pdf of the \(t\)-distribution \[ f_T(t) = \frac{\Gamma(\frac{\nu+1}2)}{\sqrt{\nu\pi}\Gamma(\frac{\nu}2)} \left( 1 + \frac{t^2}\nu\right)^{-\frac{\nu+1}2} \] with \(-\infty<t<\infty\) and \(\nu > 0\).
Moments of the \(t\)-distribution \[ \begin{align*} E[T] &= 0 \\ \text{Var}[T] &= \frac \nu{\nu-2} \end{align*} \]
If we have \(N\) iid RVs \(x_i \sim \mathcal{N}(\mu, \sigma^2)\), we construct their combination1 \[ Z = \frac{\sum_{i=1}^N x_i - n\mu}{\sigma \sqrt{N}} = \frac{\hat x - \mu}{\sigma / \sqrt n} \sim \mathcal{N}(0,1) \] Remember also from the distribution of the sample variance \[ X = (N-1) \frac{S^2}{\sigma^2} \sim \chi^2_{N-1} \]
It follows that \[ T = \frac{Z}{\sqrt{X / (N-1)}} = \frac{\hat x - \mu}{S / \sqrt N} \sim t\text{-Student} (\nu=N-1) \]
If we have two datasets, we can form a variable that is approximately \(t\)-distributed: \[ \begin{align} T &= \frac{\hat x_1 - \hat x_2 - (\mu_1 - \mu_2)}{S_D} \\ S_D &= \sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}} \end{align} \]
Given two independent \(\chi^2\) RVs, \(X\) and \(Y\), with dofs \(\nu_1, \nu_2\), then \[ F = \frac{X / \nu_1}{Y / \nu_2} \] is distributed as \[ P(F; \nu_1, \nu_2) = \frac{\Gamma[(\nu_1+\nu_2)/2]}{\Gamma(\nu_1/2)\Gamma(\nu_2/2)}\left(\frac{\nu_1}{\nu_2}\right)^{\frac{\nu_1}2} \frac{F^{\frac{\nu_1-2}2}}{(1+F\nu_1/\nu_2)^{(\nu_1 + \nu_2)/2}} \]
Statistic | |
---|---|
mean | \(\mathcal{N}\left(\mu, \frac{\sigma}{\sqrt N}\right)\) |
variance | \(\chi_{N-1}^2\) |
\(\text{mean}/\sqrt{\text{variance}}\) | \(t\)-Student |
\(\text{variance}_1 / \text{variance}_2\) | \(F\)-distribution |
The following slides illustrates how some of these expected distributions can be used.
How likely is it to obtain the unknown parameters (e.g., \(\mu\) and \(\sigma\)) in a given region?
\[ P(\theta_{\alpha/2} < \theta < \theta_{1-\alpha/2}) = 1-\alpha \]
We flip a coin 100 times. Outcome is \(\{H: 60; T: 40\}\). Is the coin fair?
Use moments of the binomial distribution \[ \begin{align} \mu &= np = 50 \\ \sigma &= \sqrt{np(1-p)} = 5 \end{align} \] and calculate the z scores for heads and tails \[ \begin{align} z &= \frac{k-\mu}\sigma = \frac{60-50}5 = 2 \\ z &= \frac{k-\mu}\sigma = \frac{40-50}5 = -2 \end{align} \]
Look up CDF of Normal distribution to find that \(p = 2\times 0.0228 = 0.0456\). Reject?
We flip a coin 100 times. Outcome is \(\{H: 60; T: 40\}\). Is the coin fair?
\[ \bar \chi^2_1 = \frac{(60-50)^2}{50} + \frac{(40-50)^2}{50} = 4.0 \]
Look up \(\chi^2\) tables to find that \(P(\chi_1^2 > 4.0) = 0.046\). Reject?
Compare the averages of two groups and determine if the differences between them are more likely to arise from random chance. Consider the variable \[ T = \frac{\hat X - \mu}{ \sigma / \sqrt N}, \] which follows the \(t\)-Student distribution with \(\nu=N-1\) DoFs.2
Setup a two-tailed (i.e., two-sided difference) \(t\)-test to test \(H_0\) with significance level 0.05.
\[ T = \frac{ \hat{x}_\text{Alice} - \hat{x}_\text{Bob} }{ \sqrt{ \frac{\hat\sigma_\text{Alice}^2}{N_\text{Alice}} + \frac{\hat\sigma_\text{Bob}^2}{N_\text{Bob}}} } = -1.75 \]
Use \(t\)-Student distribution with \(\nu = 30+30 - 2 = 58\) DoFs.
Look up \(t\)-Student table with 58 DoFs and significance level 0.05 leads to \(\pm 2.0\). But \(-1.75\) is within the interval! Cannot reject the hypothesis.
Compare the variances of two independent \(\chi^2\) variables, \(X\) and \(Y\), with \(\nu_1\) and \(\nu_2\) DoFs. Then the variable \[ F = \frac{X/\nu_1}{Y/\nu_2} \] follows the \(F\)-distribution with \((\nu_1, \nu_2)\) DoFs.2
You decide to use an F-test to determine if there’s a significant difference in the variability of plant height between the two fertilizers.
Consider a sample of \(N\) data \(d_i\), divided into \(k\) mutually exclusive bins with \(n_i\) data each. Suppose we know from some theory that a fraction \(p_k\) of data should go in bin \(k\).
Consider the case \(k=2\). \(p_1\) is simply given by the binomial \[ P(n_1; N, p) = {N \choose n_1} p_1^{n_1}(1-p_1)^{N-n_1} \] which yields the moments \(\langle n_1 \rangle = Np_1\) and \(\sigma^2 = Np_1(1-p_1)\).
Form the standardized variable \[ Y = \frac{n_1 - Np_1}{\sqrt{Np_1(1-p_1)}} \sim \mathcal{N}(0,1) \] and therefore \(Y^2 \sim \chi_1^2\). One can show that \(Y^2\) can be written as \[ Y^2 = \sum_{i=1}^2 \frac{(n_i - Np_i)^2}{Np_i} \]
Generalization to \(k>2\) \[ Y^2 = \sum_{i=1}^k \frac{(n_i - Np_i)^2}{Np_i} \sim \chi^2_{k-1} \]
Reject \(H_0\) with \(1-\alpha\) confidence level if \(\chi^2_{k-1}\) is found to lie in the upper \(\alpha\%\) of the distribution (upper 1-tail test).
Squeeze more information out of a limited dataset