Lecture 14: Information theory
Institute for Theoretical Physics, Heidelberg University
Please complete by Jan 31, 2024.
What: MVSem
When: Block seminar, Summer term 2024
Lecturers: Tristan Bereau, Falko Ziebert
Why is it so hard to get ketchup out of its bottle? How do soap bubbles form? Soft matter is the physics of everyday life!
Soft matter systems display unique physics, including fractality, phase transitions, and self-organization. We will discuss the main theoretical concepts needed to describe soft condensed matter systems like polymers, liquid crystals, membranes, complex fluids and colloids.
What: MVSem
When: Summer term 2024
Lecturers: Rebecca Wade, Tristan Bereau
Recent developments in machine learning methods has fueled progress in biomolecular simulations.
In this seminar we will explore the recent literature on these efforts ranging from protein structure and dynamics, to drug design, to small molecules. The encoding of physical inductive bias (e.g., symmetries) in the representation or architecture will be one of the core topics.
Entropy
Measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution.
Consider a discrete random variable \(X\) with distribution \(p\) over \(K\) states. The entropy is defined by2 3 \[ \mathbb{H}(X) = -\sum_{k=1}^K p(X=k) \log_2 p(X=k) = - E_X[\log_2 p(X)] \]
Consider a binary random variable, \(X \in \{0,1\}\), say for coin tossing. We write \(p(X=1) = \theta\) and \(p(X=0) = 1-\theta\). The entropy yields \[ \mathbb{H}(X) = -\left[ \theta \log_2\theta + (1-\theta) \log_2\theta \right] \] whose maximum value of 1 bit occurs when the coin is fair, \(\theta = \frac 12\). Situation of maximum uncertainty. Any other parameter value will yield lower entropy, i.e., reduced uncertainty.
Cross entropy
The cross entropy between distribution \(p\) and \(q\) is defined by \[ \mathbb{H}_\text{ce}(p, q) = - \sum_{k=1}^K p_k \log q_k \]
Joint entropy
The joint entropy of two random variables \(X\) and \(Y\) is defined as \[ \mathbb{H}(X,Y) = - \sum_{x,y} p(x,y) \log_2 p(x,y) \]
\(n\) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
\(X\) | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
\(Y\) | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
The joint distribution is
\(p(X,Y)\) | \(Y=0\) | \(Y=1\) |
---|---|---|
\(X=0\) | \(\frac 18\) | \(\frac 38\) |
\(X=1\) | \(\frac 38\) | \(\frac 18\) |
So the joint entropy yields \(\mathbb{H}(X,Y) = -\left[ \frac 18 \log_2 \frac 18 + \dots\right] = 1.81 \text{ bits}\).
On the other hand, the marginal probabilities are uniform, \(p(X=0) = p(X=1) = p(Y=0) = p(Y=1) = 0.5\), so \(\mathbb{H}(X) = \mathbb{H}(Y) = 1\). As such we have \[ \mathbb{H}(X,Y) < \mathbb{H}(X) + \mathbb{H}(Y) \] which serves as an upper bond on the joint entropy (for when \(X\) and \(Y\) are independent).
Lower bound on \(\mathbb{H}(X,Y)\): If \(Y\) is a deterministic function of \(X\), then \(\mathbb{H}(X,Y) = \mathbb{H}(X)\).
Conditional entropy
The conditional entropy of \(Y\) given \(X\) is the uncertainty we have in \(Y\) after seeing \(X\), averaged over possible values for \(X\) \[ \begin{align*} \mathbb{H}(Y|X) &= E_{p(X)}[\mathbb{H}(p(Y|X))] \\ &= \sum_x p(x) \mathbb{H}(Y|X=x) \\ &= -\sum_x p(x) \sum_y p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x)p(y|x) \log_2 p(y|x) \\ &= -\sum_{x,y} p(x, y) \log_2 \frac{p(x,y)}{p(x)} \\ &= -\sum_{x,y} p(x, y) \log_2 p(x,y) + \sum_x p(x) \log_2 p(x) \\ &= \mathbb{H}(X,Y) - \mathbb{H}(X) \end{align*} \]
Definition
Given two distributions \(p\) and \(q\), define a distance metric to measure how “close” or “similar” they are. Use notion of divergence instead of metric (i.e., not symmetric).
For discrete distributions \[ D_\text{KL}(p \Vert q) = \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \] and analogously for continuous distributions.
\[ \begin{align*} D_\text{KL}(p \Vert q) &= \sum_{k=1}^K p_k \log \frac {p_k}{q_k} \\ &= \underbrace{- \sum_{k=1}^K p_k \log q_k}_{\mathbb{H}_\text{ce}(p,q)} + \underbrace{\sum_{k=1}^K p_k \log p_k}_{-\mathbb{H}(p)} \end{align*} \] Interpration: extra amount of information or “surprise” when using \(q\) to approximate \(p\) instead of using \(p\)’s true distribution. If \(p=q\), then the two terms cancel out and the divergence is 0 (minimum value).
Scalar case: \[ D_\text{KL}(\mathcal{N}( x| \mu_1, \sigma_1^2) \Vert \mathcal{N}( x| \mu_2, \sigma_2^2)) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma^2_1 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac 12 \]
Objective: Find the distribution \(q\) that is as close as possible to \(p\), as measured by the KL divergence: \[ \begin{align*} q^* &= \arg \min_q D_\text{KL}(p \Vert q) \\ &= \arg \min_q \int \text{d}x\, p(x) \log p(x) - \int \text{d}x\, p(x) \log q(x) \end{align*} \]
Suppose that \(p\) is the empirical distribution, i.e., place probability mass only on the training data and zero mass everywhere else \[ p_\mathcal{D}(x) = \frac 1N \sum_{n=1}^N \delta(x-x_n) \]
Replace \(p(x)\) by \(p_\mathcal{D}(x)\) \[ \begin{align*} D_\text{KL}(p_\mathcal{D} \Vert q) &= - \int \text{d}x\, p(x) \log q(x) + C \\ &= - \int \text{d}x\, \left[ \frac 1N \sum_n \delta(x-x_n) \right] \log q(x) + C \\ &= - \frac 1N \sum_n \log q(x_n) + C \end{align*} \] where \(C = \int \text{d}x\, p(x) \log p(x)\) is a constant independent of \(q\). This is called the cross-entropy objective, and is equal to the average negative log likelihood of \(q\) on the training set.
Minimizing KL divergence to the empirical distribution is equivalent to maximizing likelihood.
\[ D_\text{KL}(p_\mathcal{D} \Vert q) = - \frac 1N \sum_n \log q(x_n) + C \] So what’s the problem?
Likelihood-based training puts too much weight on the training set. Will lead to generalization issues.
Approaches: smoothen the empirical distribution, data augmentation, etc.
Definitions
\[ D_\text{KL}(p \Vert q) = \int\text{d}x\, p(x) \log \frac {p(x)}{q(x)} \] Minimize wrt \(q\) is known as a moment projection. If \(p(x)>0\) but \(q(x)=0\), the \(\log\) term will be infinite. Will force \(q\) to include areas where \(p\) is non-zero. \(q\) will be mode-covering. It will typically over-estimate the support of \(p\). See panel (a).
\[ D_\text{KL}(q \Vert p) = \int\text{d}x\, q(x) \log \frac {q(x)}{p(x)} \] Minimize wrt \(q\) is known as an information projection. In any region where \(p(x)=0\) but \(q(x)>0\), the \(\log\) term will be infinite. Will force \(q\) to exclude all areas where \(p\) has zero probability. Will place probability mass in very few parts of space—called mode-seeking behavior. It will typically under-estimate the support of \(p\). See panels (b) and (c).
Measure the dependence of two random variables, \(X\) and \(Y\), through the similarity of their distributions
Definition
For discrete random variables \[ \mathbb{I}(X;Y) = D_\text{KL}(p(x,y)\Vert p(x)p(y)) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \] We achieve the lower bound of 0 iff \(p(x,y) = p(x)p(y)\).
Rewrite as a function of entropies \[ \begin{aligned} \mathbb{I}(X;Y) &= D_\text{KL}(p(x,y)\Vert p(x)p(y)) \\ &= \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac {p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y} p(x,y) \log p(x, y) - \sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) \\ &= \mathbb{H}(Y) - \mathbb{H}(Y|X) = \mathbb{H}(X) - \mathbb{H}(X|Y) \end{aligned} \] One can also show the following relation \[ \mathbb{I}(X;Y) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \]