Computational Statistics & Data Analysis (MVComp2)

Lecture 14: Information theory

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Commercial break

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Soft Matter Physics

What: MVSem

When: Block seminar, Summer term 2024

Lecturers: Tristan Bereau, Falko Ziebert

Why is it so hard to get ketchup out of its bottle? How do soap bubbles form? Soft matter is the physics of everyday life!

Soft matter systems display unique physics, including fractality, phase transitions, and self-organization. We will discuss the main theoretical concepts needed to describe soft condensed matter systems like polymers, liquid crystals, membranes, complex fluids and colloids.

Machine learning for the biomolecular world

What: MVSem

When: Summer term 2024

Lecturers: Rebecca Wade, Tristan Bereau

Recent developments in machine learning methods has fueled progress in biomolecular simulations.

In this seminar we will explore the recent literature on these efforts ranging from protein structure and dynamics, to drug design, to small molecules. The encoding of physical inductive bias (e.g., symmetries) in the representation or architecture will be one of the core topics.

Introduction

Literature

Murphy (2022)
Chapter 6 on Information theory

Recap from last time

Dimensionality reduction
Original visible space may be too large, widh to reduce it. Find mapping to a low-dimensional latent space.
Principal component analysis
Find linear and orthogonal projection. Eigenvectors oriented along the directions of largest variance of the data. Choose the number of dimensions by looking at the fraction of variance explained.
Factor analysis
Generative model. Equivalent to probabilistic PCA
Autoencoders
Find nonlinear mapping that encode/decode the data between original and latent spaces. Variational autoencoder is generative.

Entropy

Intuition

Entropy

Measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution.

Information content
Observe a sequence \(X_n \sim p\) generated from distribution \(p\). If \(p\) has high entropy, it will be hard to predict the value of each observation \(X_n\). Hence the dataset has high information content. On the other hand, a distribution with 0 entropy will always yield the same \(X_n\), so the dataset does not contain much information. Link to data compression.

Entropy for discrete random variables1

Consider a discrete random variable \(X\) with distribution \(p\) over \(K\) states. The entropy is defined by2 3 \[ \mathbb{H}(X) = -\sum_{k=1}^K p(X=k) \log_2 p(X=k) = - E_X[\log_2 p(X)] \]

Maximum entropy
The discrete distribution with maximum entropy is the uniform distribution (see information content in previous slide). Maximum uncertainty.
Minimum entropy
The distribution with minimum entropy (which is zero) is any delta function, putting all its mass on one state. This distribution has no uncertainty.

Entropy for a binary random variables

Consider a binary random variable, \(X \in \{0,1\}\), say for coin tossing. We write \(p(X=1) = \theta\) and \(p(X=0) = 1-\theta\). The entropy yields \[ \mathbb{H}(X) = -\left[ \theta \log_2\theta + (1-\theta) \log_2\theta \right] \] whose maximum value of 1 bit occurs when the coin is fair, \(\theta = \frac 12\). Situation of maximum uncertainty. Any other parameter value will yield lower entropy, i.e., reduced uncertainty.

Cross entropy1

Cross entropy

The cross entropy between distribution \(p\) and \(q\) is defined by \[ \mathbb{H}_\text{ce}(p, q) = - \sum_{k=1}^K p_k \log q_k \]

Intuition
Expected number of bits needed to compress some data samples drawn from distribution \(p\) using a code based on distribution \(q\). This can be minimized by setting \(q=p\), in which case the expected number of bits of the optimal code is \(\mathbb{H}_\text{ce}(p, p) = \mathbb{H}(p)\).

Joint entropy1

Joint entropy

The joint entropy of two random variables \(X\) and \(Y\) is defined as \[ \mathbb{H}(X,Y) = - \sum_{x,y} p(x,y) \log_2 p(x,y) \]

Example
Consider choosing an integer from 1 to 8. Let \(X(n)=1\) if \(n\) is even and \(Y(n)=1\) is \(n\) is prime, i.e.,
\(n\) 1 2 3 4 5 6 7 8
\(X\) 0 1 0 1 0 1 0 1
\(Y\) 0 1 1 0 1 0 1 0

The joint distribution is

\(p(X,Y)\) \(Y=0\) \(Y=1\)
\(X=0\) \(\frac 18\) \(\frac 38\)
\(X=1\) \(\frac 38\) \(\frac 18\)

So the joint entropy yields \(\mathbb{H}(X,Y) = -\left[ \frac 18 \log_2 \frac 18 + \dots\right] = 1.81 \text{ bits}\).

On the other hand, the marginal probabilities are uniform, \(p(X=0) = p(X=1) = p(Y=0) = p(Y=1) = 0.5\), so \(\mathbb{H}(X) = \mathbb{H}(Y) = 1\). As such we have \[ \mathbb{H}(X,Y) < \mathbb{H}(X) + \mathbb{H}(Y) \] which serves as an upper bond on the joint entropy (for when \(X\) and \(Y\) are independent).

Lower bound on \(\mathbb{H}(X,Y)\): If \(Y\) is a deterministic function of \(X\), then \(\mathbb{H}(X,Y) = \mathbb{H}(X)\).

Conditional entropy1

Conditional entropy

The conditional entropy of \(Y\) given \(X\) is the uncertainty we have in \(Y\) after seeing \(X\), averaged over possible values for \(X\) \[ \begin{align*} \mathbb{H}(Y|X) &= E_{p(X)}[\mathbb{H}(p(Y|X))] \\ &= \sum_x p(x) \mathbb{H}(Y|X=x) \\ &= -\sum_x p(x) \sum_y p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x)p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x, y) \log_2 \frac{p(x,y)}{p(x)} \\ &= -\sum_{x,y} p(x, y) \log_2 p(x,y) + \sum_x p(x) \log_2 p(x) \\ &= \mathbb{H}(X,Y) - \mathbb{H}(X) \end{align*} \]

  • If \(Y\) is a deterministic function of \(X\), then knowing \(X\) completely determines \(Y\), so that \(\mathbb{H}(Y|X) = 0\).
  • If \(X\) and \(Y\) are independent, knowing \(X\) tells us nothing about \(Y\) and \(\mathbb{H}(Y|X) = \mathbb{H}(Y)\).

Relative entropy

Relative entropy1

Definition

Relative entropy

Given two distributions \(p\) and \(q\), define a distance metric to measure how “close” or “similar” they are. Use notion of divergence instead of metric (i.e., not symmetric).

Kullback-Leibler (KL) divergence

For discrete distributions \[ D_\text{KL}(p \Vert q) = \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \] and analogously for continuous distributions.

Interpretation of the KL divergence

\[ \begin{align*} D_\text{KL}(p \Vert q) &= \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \\ &= \underbrace{- \sum_{k=1}^K p_k \log q_k}_{\mathbb{H}_\text{ce}(p,q)} + \underbrace{\sum_{k=1}^K p_k \log p_k}_{-\mathbb{H}(p)} \end{align*} \] Interpration: extra amount of information or “surprise” when using \(q\) to approximate \(p\) instead of using \(p\)’s true distribution. If \(p=q\), then the two terms cancel out and the divergence is 0 (minimum value).

Example: KL divergence between two gaussians1

Scalar case: \[ D_\text{KL}(\mathcal{N}( x| \mu_1, \sigma_1^2) \Vert \mathcal{N}( x| \mu_2, \sigma_2^2)) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma^2_1 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac 12 \]

  • Note that this is indeed not symmetric
  • \(D_\text{KL}(\mathcal{N}( x| \mu_1, \sigma^2_1) \Vert \mathcal{N}( x| \mu_1, \sigma^2_1)) = 0\), as expected

KL divergence and MLE1

Objective: Find the distribution \(q\) that is as close as possible to \(p\), as measured by the KL divergence: \[ \begin{align*} q^* &= \arg \min_q D_\text{KL}(p \Vert q) \\ &= \arg \min_q \int \text{d}x\, p(x) \log p(x) - \int \text{d}x\, p(x) \log q(x) \end{align*} \]

Suppose that \(p\) is the empirical distribution, i.e., place probability mass only on the training data and zero mass everywhere else \[ p_\mathcal{D}(x) = \frac 1N \sum_{n=1}^N \delta(x-x_n) \]

Replace \(p(x)\) by \(p_\mathcal{D}(x)\) \[ \begin{align*} D_\text{KL}(p_\mathcal{D} \Vert q) &= - \int \text{d}x\, p(x) \log q(x) + C \\ &= - \int \text{d}x\, \left[ \frac 1N \sum_n \delta(x-x_n) \right] \log q(x) + C \\ &= - \frac 1N \sum_n \log q(x_n) + C \end{align*} \] where \(C = \int \text{d}x\, p(x) \log p(x)\) is a constant independent of \(q\). This is called the cross-entropy objective, and is equal to the average negative log likelihood of \(q\) on the training set.

Minimizing KL divergence to the empirical distribution is equivalent to maximizing likelihood.

KL divergence and MLE1

\[ D_\text{KL}(p_\mathcal{D} \Vert q) = - \frac 1N \sum_n \log q(x_n) + C \] So what’s the problem?

Likelihood-based training puts too much weight on the training set. Will lead to generalization issues.

Approaches: smoothen the empirical distribution, data augmentation, etc.

Forward and reverse KL1

Definitions

Forward KL

\[ D_\text{KL}(p \Vert q) = \int\text{d}x\, p(x) \log \frac {p(x)}{q(x)} \] Minimize wrt \(q\) is known as a moment projection. If \(p(x)>0\) but \(q(x)=0\), the \(\log\) term will be infinite. Will force \(q\) to include areas where \(p\) is non-zero. \(q\) will be mode-covering. It will typically over-estimate the support of \(p\). See panel (a).

Reverse KL

\[ D_\text{KL}(q \Vert p) = \int\text{d}x\, q(x) \log \frac {q(x)}{p(x)} \] Minimize wrt \(q\) is known as an information projection. In any region where \(p(x)=0\) but \(q(x)>0\), the \(\log\) term will be infinite. Will force \(q\) to exclude all areas where \(p\) has zero probability. Will place probability mass in very few parts of space—called mode-seeking behavior. It will typically under-estimate the support of \(p\). See panels (b) and (c).

Blue: \(p(x)\); Red: \(q(x)\). (a) Minimizing forwards KL causes \(q\) to cover \(p\). (b-c) Minimizing reverse KL causes \(q\) to lock onto one mode.

Mutual information

Mutual information1

Measure the dependence of two random variables, \(X\) and \(Y\), through the similarity of their distributions

Definition

For discrete random variables \[ \mathbb{I}(X;Y) = D_\text{KL}(p(x,y)\Vert p(x)p(y)) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \] We achieve the lower bound of 0 iff \(p(x,y) = p(x)p(y)\).

Interpretation1

Rewrite as a function of entropies \[ \begin{aligned} \mathbb{I}(X;Y) &= D_\text{KL}(p(x,y)\Vert p(x)p(y)) \\ &= \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y} p(x,y) \log p(x, y) - \sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) \\ &= \mathbb{H}(Y) - \mathbb{H}(Y|X) = \mathbb{H}(X) - \mathbb{H}(X|Y) \end{aligned} \] One can also show the following relation \[ \mathbb{I}(X;Y) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \]

Summary

Summary

Entropy
Measure of uncertainty, or lack of predictability, associated with RV
Information content
Entropy scales with uncertainty: maximal for uniform distribution, minimal for delta function.
Relative entropy
Kullback-Leibler divergence as measure to quantify how far \(p\) is from \(q\). Information gain between two distributions. Minimizing KL corresponds to MLE.
Mutual information
Dependence of two RVs. Measure the similarity of their distributions

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.