Computational Statistics & Data Analysis (MVComp2)

Lecture 14: Information theory

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Commercial break

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Soft Matter Physics

What: MVSem

When: Block seminar, Summer term 2024

Lecturers: Tristan Bereau, Falko Ziebert

Why is it so hard to get ketchup out of its bottle? How do soap bubbles form? Soft matter is the physics of everyday life!

Soft matter systems display unique physics, including fractality, phase transitions, and self-organization. We will discuss the main theoretical concepts needed to describe soft condensed matter systems like polymers, liquid crystals, membranes, complex fluids and colloids.

Machine learning for the biomolecular world

What: MVSem

When: Summer term 2024

Lecturers: Rebecca Wade, Tristan Bereau

Recent developments in machine learning methods has fueled progress in biomolecular simulations.

In this seminar we will explore the recent literature on these efforts ranging from protein structure and dynamics, to drug design, to small molecules. The encoding of physical inductive bias (e.g., symmetries) in the representation or architecture will be one of the core topics.

Introduction

Literature

Murphy (2022): Chapter 6 on Information theory

Recap from last time

Dimensionality reduction: Original visible space may be too large, widh to reduce it. Find mapping to a low-dimensional latent space.
Principal component analysis: Find linear and orthogonal projection. Eigenvectors oriented along the directions of largest variance of the data. Choose the number of dimensions by looking at the fraction of variance explained.
Factor analysis: Generative model. Equivalent to probabilistic PCA
Autoencoders: Find nonlinear mapping that encode/decode the data between original and latent spaces. Variational autoencoder is generative.

Entropy

Intuition

Entropy

Measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution.

Information content: Observe a sequence \(X_n \sim p\) generated from distribution \(p\). If \(p\) has high entropy, it will be hard to predict the value of each observation \(X_n\). Hence the dataset has high information content. On the other hand, a distribution with 0 entropy will always yield the same \(X_n\), so the dataset does not contain much information. Link to data compression.

Entropy for discrete random variables¹

Consider a discrete random variable \(X\) with distribution \(p\) over \(K\) states. The entropy is defined by² ³ \[ \mathbb{H}(X) = -\sum_{k=1}^K p(X=k) \log_2 p(X=k) = - E_X[\log_2 p(X)] \]

Maximum entropy: The discrete distribution with maximum entropy is the uniform distribution (see information content in previous slide). Maximum uncertainty.
Minimum entropy: The distribution with minimum entropy (which is zero) is any delta function, putting all its mass on one state. This distribution has no uncertainty.

Entropy for a binary random variables

Consider a binary random variable, \(X \in \{0,1\}\), say for coin tossing. We write \(p(X=1) = \theta\) and \(p(X=0) = 1-\theta\). The entropy yields \[ \mathbb{H}(X) = -\left[ \theta \log_2\theta + (1-\theta) \log_2\theta \right] \] whose maximum value of 1 bit occurs when the coin is fair, \(\theta = \frac 12\). Situation of maximum uncertainty. Any other parameter value will yield lower entropy, i.e., reduced uncertainty.

Cross entropy¹

Cross entropy

The cross entropy between distribution \(p\) and \(q\) is defined by \[ \mathbb{H}_\text{ce}(p, q) = - \sum_{k=1}^K p_k \log q_k \]

Intuition: Expected number of bits needed to compress some data samples drawn from distribution \(p\) using a code based on distribution \(q\). This can be minimized by setting \(q=p\), in which case the expected number of bits of the optimal code is \(\mathbb{H}_\text{ce}(p, p) = \mathbb{H}(p)\).

Joint entropy¹

Joint entropy

The joint entropy of two random variables \(X\) and \(Y\) is defined as \[ \mathbb{H}(X,Y) = - \sum_{x,y} p(x,y) \log_2 p(x,y) \]

Example: Consider choosing an integer from 1 to 8. Let \(X(n)=1\) if \(n\) is even and \(Y(n)=1\) is \(n\) is prime, i.e.,

\(n\)	1	2	3	4	5	6	7	8
\(X\)	0	1	0	1	0	1	0	1
\(Y\)	0	1	1	0	1	0	1	0

The joint distribution is

\(p(X,Y)\)	\(Y=0\)	\(Y=1\)
\(X=0\)	\(\frac 18\)	\(\frac 38\)
\(X=1\)	\(\frac 38\)	\(\frac 18\)

So the joint entropy yields \(\mathbb{H}(X,Y) = -\left[ \frac 18 \log_2 \frac 18 + \dots\right] = 1.81 \text{ bits}\).

On the other hand, the marginal probabilities are uniform, \(p(X=0) = p(X=1) = p(Y=0) = p(Y=1) = 0.5\), so \(\mathbb{H}(X) = \mathbb{H}(Y) = 1\). As such we have \[ \mathbb{H}(X,Y) < \mathbb{H}(X) + \mathbb{H}(Y) \] which serves as an upper bond on the joint entropy (for when \(X\) and \(Y\) are independent).

Lower bound on \(\mathbb{H}(X,Y)\): If \(Y\) is a deterministic function of \(X\), then \(\mathbb{H}(X,Y) = \mathbb{H}(X)\).

Conditional entropy¹

Conditional entropy

The conditional entropy of \(Y\) given \(X\) is the uncertainty we have in \(Y\) after seeing \(X\), averaged over possible values for \(X\) \[ \begin{align*} \mathbb{H}(Y|X) &= E_{p(X)}[\mathbb{H}(p(Y|X))] \\ &= \sum_x p(x) \mathbb{H}(Y|X=x) \\ &= -\sum_x p(x) \sum_y p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x)p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x, y) \log_2 \frac{p(x,y)}{p(x)} \\ &= -\sum_{x,y} p(x, y) \log_2 p(x,y) + \sum_x p(x) \log_2 p(x) \\ &= \mathbb{H}(X,Y) - \mathbb{H}(X) \end{align*} \]

If \(Y\) is a deterministic function of \(X\), then knowing \(X\) completely determines \(Y\), so that \(\mathbb{H}(Y|X) = 0\).
If \(X\) and \(Y\) are independent, knowing \(X\) tells us nothing about \(Y\) and \(\mathbb{H}(Y|X) = \mathbb{H}(Y)\).

Relative entropy

Relative entropy¹

Definition

Relative entropy

Given two distributions \(p\) and \(q\), define a distance metric to measure how “close” or “similar” they are. Use notion of divergence instead of metric (i.e., not symmetric).

Kullback-Leibler (KL) divergence

For discrete distributions \[ D_\text{KL}(p \Vert q) = \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \] and analogously for continuous distributions.

Interpretation of the KL divergence

\[ \begin{align*} D_\text{KL}(p \Vert q) &= \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \\ &= \underbrace{- \sum_{k=1}^K p_k \log q_k}_{\mathbb{H}_\text{ce}(p,q)} + \underbrace{\sum_{k=1}^K p_k \log p_k}_{-\mathbb{H}(p)} \end{align*} \] Interpration: extra amount of information or “surprise” when using \(q\) to approximate \(p\) instead of using \(p\)’s true distribution. If \(p=q\), then the two terms cancel out and the divergence is 0 (minimum value).

Example: KL divergence between two gaussians¹

Scalar case: \[ D_\text{KL}(\mathcal{N}( x| \mu_1, \sigma_1^2) \Vert \mathcal{N}( x| \mu_2, \sigma_2^2)) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma^2_1 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac 12 \]

Note that this is indeed not symmetric
\(D_\text{KL}(\mathcal{N}( x| \mu_1, \sigma^2_1) \Vert \mathcal{N}( x| \mu_1, \sigma^2_1)) = 0\), as expected

KL divergence and MLE¹

Objective: Find the distribution \(q\) that is as close as possible to \(p\), as measured by the KL divergence: \[ \begin{align*} q^* &= \arg \min_q D_\text{KL}(p \Vert q) \\ &= \arg \min_q \int \text{d}x\, p(x) \log p(x) - \int \text{d}x\, p(x) \log q(x) \end{align*} \]

Suppose that \(p\) is the empirical distribution, i.e., place probability mass only on the training data and zero mass everywhere else \[ p_\mathcal{D}(x) = \frac 1N \sum_{n=1}^N \delta(x-x_n) \]

Replace \(p(x)\) by \(p_\mathcal{D}(x)\) \[ \begin{align*} D_\text{KL}(p_\mathcal{D} \Vert q) &= - \int \text{d}x\, p(x) \log q(x) + C \\ &= - \int \text{d}x\, \left[ \frac 1N \sum_n \delta(x-x_n) \right] \log q(x) + C \\ &= - \frac 1N \sum_n \log q(x_n) + C \end{align*} \] where \(C = \int \text{d}x\, p(x) \log p(x)\) is a constant independent of \(q\). This is called the cross-entropy objective, and is equal to the average negative log likelihood of \(q\) on the training set.

Minimizing KL divergence to the empirical distribution is equivalent to maximizing likelihood.

KL divergence and MLE¹

\[ D_\text{KL}(p_\mathcal{D} \Vert q) = - \frac 1N \sum_n \log q(x_n) + C \] So what’s the problem?

Likelihood-based training puts too much weight on the training set. Will lead to generalization issues.

Approaches: smoothen the empirical distribution, data augmentation, etc.

Forward and reverse KL¹

Definitions

Forward KL

\[ D_\text{KL}(p \Vert q) = \int\text{d}x\, p(x) \log \frac {p(x)}{q(x)} \] Minimize wrt \(q\) is known as a moment projection. If \(p(x)>0\) but \(q(x)=0\), the \(\log\) term will be infinite. Will force \(q\) to include areas where \(p\) is non-zero. \(q\) will be mode-covering. It will typically over-estimate the support of \(p\). See panel (a).

Reverse KL

\[ D_\text{KL}(q \Vert p) = \int\text{d}x\, q(x) \log \frac {q(x)}{p(x)} \] Minimize wrt \(q\) is known as an information projection. In any region where \(p(x)=0\) but \(q(x)>0\), the \(\log\) term will be infinite. Will force \(q\) to exclude all areas where \(p\) has zero probability. Will place probability mass in very few parts of space—called mode-seeking behavior. It will typically under-estimate the support of \(p\). See panels (b) and (c).

Mutual information

Mutual information¹

Measure the dependence of two random variables, \(X\) and \(Y\), through the similarity of their distributions

Definition

For discrete random variables \[ \mathbb{I}(X;Y) = D_\text{KL}(p(x,y)\Vert p(x)p(y)) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \] We achieve the lower bound of 0 iff \(p(x,y) = p(x)p(y)\).

Interpretation¹

Rewrite as a function of entropies \[ \begin{aligned} \mathbb{I}(X;Y) &= D_\text{KL}(p(x,y)\Vert p(x)p(y)) \\ &= \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y} p(x,y) \log p(x, y) - \sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) \\ &= \mathbb{H}(Y) - \mathbb{H}(Y|X) = \mathbb{H}(X) - \mathbb{H}(X|Y) \end{aligned} \] One can also show the following relation \[ \mathbb{I}(X;Y) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \]

Summary

Entropy: Measure of uncertainty, or lack of predictability, associated with RV
Information content: Entropy scales with uncertainty: maximal for uniform distribution, minimal for delta function.
Relative entropy: Kullback-Leibler divergence as measure to quantify how far \(p\) is from \(q\). Information gain between two distributions. Minimizing KL corresponds to MLE.
Mutual information: Dependence of two RVs. Measure the similarity of their distributions

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Commercial break

Course evaluation

Soft Matter Physics

Machine learning for the biomolecular world

Introduction

Literature

Recap from last time

Entropy

Intuition

Entropy for discrete random variables1

Entropy for a binary random variables

Cross entropy1

Joint entropy1

Conditional entropy1

Relative entropy

Relative entropy1

Relative entropy

Kullback-Leibler (KL) divergence

Interpretation of the KL divergence

Example: KL divergence between two gaussians1

KL divergence and MLE1

KL divergence and MLE1

Forward and reverse KL1

Forward KL

Reverse KL

Mutual information

Mutual information1

Interpretation1

Summary

Summary

References

Entropy for discrete random variables¹

Cross entropy¹

Joint entropy¹

Conditional entropy¹

Relative entropy¹

Example: KL divergence between two gaussians¹

KL divergence and MLE¹

KL divergence and MLE¹

Forward and reverse KL¹

Mutual information¹

Interpretation¹