Introduction

Toward a distribution of the variance

Chi-squared distribution

Consider $N$ iid Gaussian variables, we define $z = \sum_{i = 1}^{n} \frac{(x_{i} - μ_{i})^{2}}{σ_{i}^{2}} .$ To find the PDF of $z$ we do a variable transformation to spherical coordinates in $N$ dimensions: a radius $r^{2} = z$ and $N - 1$ angles $θ_{i}$ . Integration over the radial and angular components leads to a $χ^{2}$ distribution of the form¹ $f (z; N) = \frac{1}{2^{N / 2} Γ (N / 2)} z^{N / 2} e^{- z / 2} = χ_{N}^{2} (z) .$ Importantly, the first two moments are given by $\begin{aligned} E [Z] & = N \\ Var [Z] & = 2 N \end{aligned}$

Distribution of the sample variance

What is the distribution of the sample variance $S^{2} = \frac{1}{N - 1} \sum_{i = 1}^{N} (x_{i} - \hat{x})^{2}$ ?

We have $\begin{aligned} (N - 1) S^{2} & = \sum_{i = 1}^{N} {[(x_{i} - μ) - (\hat{x} - μ)]}^{2} \\ = \sum_{i = 1}^{N} (x_{i} - μ)^{2} - N (\hat{x} - μ)^{2} \\ (N - 1) \frac{S^{2}}{σ^{2}} & = \underset{χ_{N}^{2}}{\underset{⏟}{\sum_{i = 1}^{N} \frac{(x_{i} - μ)^{2}}{σ^{2}}}} - \underset{χ_{1}^{2}}{\underset{⏟}{\frac{(\hat{x} - μ)^{2}}{σ^{2} / N}}} \end{aligned}$ such that:¹ ² $(N - 1) \frac{S^{2}}{σ^{2}} \sim χ_{N}^{2} - χ_{1}^{2} \sim χ_{N - 1}^{2}$

Distribution of normalized variable ( $t$ -Student distr.)

Design RV of the form $mean / \sqrt{variance}$ : $T = \frac{Z}{\sqrt{X / ν}}$ where $Z \sim N (0, 1)$ and $X \sim χ_{ν}^{2}$ are independent.

Write out the joint probability density function $f_{Z, X} (z, x) = f_{Z} (z) f_{X} (x)$ then introduce a change of variables, $Z \to T = Z / \sqrt{X / ν}$ while keeping $X$ intact. Compute the Jacobian. Compute $f_{T, X} (t, x)$ . Integrate out $X$ . Obtain the pdf of the $t$ -distribution $f_{T} (t) = \frac{Γ (\frac{ν + 1}{2})}{\sqrt{ν π} Γ (\frac{ν}{2})} {(1 + \frac{t^{2}}{ν})}^{- \frac{ν + 1}{2}}$ with $- \infty < t < \infty$ and $ν > 0$ .

Moments of the $t$ -distribution $\begin{aligned} E [T] & = 0 \\ Var [T] & = \frac{ν}{ν - 2} \end{aligned}$

From sum of iid RVs to $t$ -distribution

If we have $N$ iid RVs $x_{i} \sim N (μ, σ^{2})$ , we construct their combination¹ $Z = \frac{\sum_{i = 1}^{N} x_{i} - n μ}{σ \sqrt{N}} = \frac{\hat{x} - μ}{σ / \sqrt{n}} \sim N (0, 1)$ Remember also from the distribution of the sample variance $X = (N - 1) \frac{S^{2}}{σ^{2}} \sim χ_{N - 1}^{2}$

It follows that $T = \frac{Z}{\sqrt{X / (N - 1)}} = \frac{\hat{x} - μ}{S / \sqrt{N}} \sim t -Student (ν = N - 1)$

If we have two datasets, we can form a variable that is approximately $t$ -distributed: $\begin{aligned} T & = \frac{{\hat{x}}_{1} - {\hat{x}}_{2} - (μ_{1} - μ_{2})}{S_{D}} \\ S_{D} & = \sqrt{\frac{S_{1}^{2}}{n_{1}} + \frac{S_{2}^{2}}{n_{2}}} \end{aligned}$

Statistic	PDF
mean	$N (μ, \frac{σ}{\sqrt{N}})$
variance	$χ_{N - 1}^{2}$
$mean / \sqrt{variance}$	$t$ -Student
${variance}_{1} / {variance}_{2}$	$F$ -distribution

Test construction¹

Enunciate a hypothesis $H_{0}$ concerning a parameter of the distribution (e.g., the mean is zero)
“Testing $H_{0}$ :” check whether it is consistent with the data
$p$ -value: probability of obtaining test results at least as extreme as the observed data, under the assumption that $H_{0}$ is true.
Small $p$ -value: such an extreme event would be unlikely under $H_{0}$ .²
If $p$ is smaller than some threshold, $α$ , reject the hypothesis;³ otherwise, we cannot rule it out

Example: coin toss

We flip a coin 100 times. Outcome is ${H : 60; T : 40}$ . Is the coin fair?

Hypothesis ( $H_{0}$ ): The coin is fair (50/50)

Setting: Categorical variables. Outcome of a coin flip is a Binomial. Use $χ^{2}$ to compare observed from expected proportions.

Use moments of the binomial distribution $\begin{aligned} μ & = n p = 50 \\ σ & = \sqrt{n p (1 - p)} = 5 \end{aligned}$ and calculate the z scores for heads and tails $\begin{aligned} z & = \frac{k - μ}{σ} = \frac{60 - 50}{5} = 2 \\ z & = \frac{k - μ}{σ} = \frac{40 - 50}{5} = - 2 \end{aligned}$

Look up CDF of Normal distribution to find that $p = 2 \times 0.0228 = 0.0456$ . Reject?

Example: coin toss

We flip a coin 100 times. Outcome is ${H : 60; T : 40}$ . Is the coin fair?

Hypothesis ( $H_{0}$ ): The coin is fair (50/50)

Setting: Categorical variables. Outcome of a coin flip is a Binomial. Use $χ^{2}$ to compare observed from expected proportions.

${\bar{χ}}_{1}^{2} = \frac{(60 - 50)^{2}}{50} + \frac{(40 - 50)^{2}}{50} = 4.0$

Look up $χ^{2}$ tables to find that $P (χ_{1}^{2} > 4.0) = 0.046$ . Reject?

Student’s $t$ -test: Comparing normalized variables¹

Compare the averages of two groups and determine if the differences between them are more likely to arise from random chance. Consider the variable $T = \frac{\hat{X} - μ}{σ / \sqrt{N}},$ which follows the $t$ -Student distribution with $ν = N - 1$ DoFs.²

Student’s $t$ -test: Example application

Alice’s class: average score = 75 pts., standard deviation = 10 pts.
Bob’s class: average score = 80 pts., standard deviation = 12 pts.
30 students in each class

Hypothesis Testing: Null Hypothesis ( $H_{0}$ ): The average scores in Alice’s and Bob’s classes are the same (the teachers grade equally).

Setup a two-tailed (i.e., two-sided difference) $t$ -test to test $H_{0}$ with significance level 0.05.

F-test: Example application

Effect of fertilizers on plant growth
Fertilizer A: 20 plants, variance in height = 9 cm^2.
Fertilizer B: 20 plants, variance in height = 4 cm^2.

You decide to use an F-test to determine if there’s a significant difference in the variability of plant height between the two fertilizers.

Null Hypothesis (H0): The variances of plant heights are equal ( $σ_{A}^{2} = σ_{B}^{2}$ ).

Steps: Compute $F$ statistic; look up $P (F; 19, 19)$ table; compare to significance level.

Pearson $χ^{2}$ for binned data¹

Consider the case $k = 2$ . $p_{1}$ is simply given by the binomial $P (n_{1}; N, p) = (\binom{N}{n_{1}}) p_{1}^{n_{1}} (1 - p_{1})^{N - n_{1}}$ which yields the moments $⟨ n_{1} ⟩ = N p_{1}$ and $σ^{2} = N p_{1} (1 - p_{1})$ .

Form the standardized variable $Y = \frac{n_{1} - N p_{1}}{\sqrt{N p_{1} (1 - p_{1})}} \sim N (0, 1)$ and therefore $Y^{2} \sim χ_{1}^{2}$ . One can show that $Y^{2}$ can be written as $Y^{2} = \sum_{i = 1}^{2} \frac{(n_{i} - N p_{i})^{2}}{N p_{i}}$

Generalization to $k > 2$ $Y^{2} = \sum_{i = 1}^{k} \frac{(n_{i} - N p_{i})^{2}}{N p_{i}} \sim χ_{k - 1}^{2}$

Reject $H_{0}$ with $1 - α$ confidence level if $χ_{k - 1}^{2}$ is found to lie in the upper $α %$ of the distribution (upper 1-tail test).

Bootstrap¹

Bootstrap: Estimate the variability/uncertainty of a statistic (e.g., mean or median) by resampling our data multiple times

Bag with 100 marbles. Average weight? Focus on subset $N = 10$ .²
Randomly draw 10 marbles with replacement
Compute statistic
Repeat many times (e.g., 1,000x or 10,000x)

Squeeze more information out of a limited dataset

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Parameter estimation

Frequentist approach

Distribution of the sample mean

Toward a distribution of the variance

Distribution of the sample variance

Distribution of normalized variable ( $t$ -Student distr.)

From sum of iid RVs to $t$ -distribution

$F$ -Distribution of the ratio of two variances

PDF of some statistics¹

Confidence regions¹

Confidence regions: mean and variance

Hypothesis testing

Test construction¹

Example: coin toss

Example: coin toss

Student’s $t$ -test: Comparing normalized variables¹

Student’s $t$ -test: Example application

Student’s $t$ -test: Example application

F-test: Comparing variances¹

F-test: Example application

Examples

Examples

Non-parametric tests

Pearson $χ^{2}$ for binned data¹

Pearson $χ^{2}$ for binned data¹

Bootstrap¹

Summary

Summary

References