Computational Statistics & Data Analysis (MVComp2)

Lecture 13: Dimensionality reduction

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022): Chapter 20 on Dimensionality Reduction

Recap from last time

Decision thoery: How do we choose between multiple candidate models?
Bayes factor: Compare models according to their marginal likelihoods (Bayes factor)
Occam’s razor: When unsure, choose the simpler model (fewer parameters, fewer assumptions, etc.)
Bayesian Information Criterion (BIC): Tradeoff between maximizing likelihood and minimizing model complexity

Idea

Learn a mapping from the high-dimensional visible space, \(\pmb x \in \mathbb{R}^D\), to a low-dimensional latent space, \(\pmb z \in \mathbb{R}^L\).

Mapping can be

a parametric model, \(\pmb z = f(\pmb x, \pmb \theta)\)
nonparametric, where we compute an embedding \(\pmb z_n\) for each input \(\pmb x_n\) in the data set, but not for any other point.¹

Principal component analysis

Principal component analysis¹

PCA

PCA: Simplest and most widely used form of dimensionality reduction.
Idea: Find a linear and orthogonal projection of the high-dimensional data. We project or encode \(\pmb x\) to get \(\pmb z = {\bf W}^\intercal \pmb x\), and unproject or decode \(\pmb z\) to get \(\pmb{\hat x} = {\bf W}\pmb z\).

Optimization: Intuitively we want \(\pmb{\hat x}\) to be close to the original \(\pmb x\) in \(\ell_2\) distance. Define the reconstruction error \[ \mathcal{L}({\bf W}) = \frac 1N \sum_{n=1}^N \Vert \pmb x_n - \text{decode}(\text{encode}(\pmb x_n; {\bf W}); {\bf W}) \Vert_2^2 \]

Linear problem: This turns out to be equivalent to diagonalizing the empirical covariance matrix \[ \hat\Sigma = \frac 1N \sum_{n=1}^N (\pmb x_n - \bar x)(\pmb x_n - \bar x) = \frac 1N {\bf X}^\intercal_\text{c}{\bf X}_\text{c} \] where \({\bf X}_\text{c}\) is a centered version of the \(N\times D\) design matrix.

PCA examples¹

PCA effectively finds a rotation, and ranks the directions (eigenvectors) in terms of how much (eigenvalues) they describe the variance of the data (recall empirical covariance matrix from last slide).

Derivation¹

Setting: Consider an unlabeled dataset \(\mathcal{D} = \{ \pmb x_n : n=1: N\}\), where \(\pmb x_n \in \mathbb{R}^D\). Represent this as an \(N \times D\) matrix \({\bf X}\). Assume that the data is centered, i.e., \(\pmb{\bar x} = \frac 1N \sum_{n=1}^N \pmb x_n = 0\).
Optimization: Minimize distance between \({\bf X}\) and reconstructed data \[ \mathcal{L}({\bf W}, {\bf Z}) = \frac 1N \Vert {\bf X} - {\bf ZW}^\intercal \Vert^2 = \frac 1N \sum_{n=1}^N \Vert \pmb x_n - {\bf W}\pmb z_n\Vert^2 \]

Derivation (1D case)

Estimate the best 1d solution, \(\pmb w_1 \in \mathbb{R}^D\).¹

Coefficient for the first basis vector: \(\pmb{\tilde{z}} = [z_{11}, \dots, z_{N1}] \in \mathbb{R}^N\). Reconstruction error: \[ \begin{align*} \mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) &= \frac 1N \sum_{n=1}^N (\pmb x_n - z_{n1}\pmb w_1)^\intercal(\pmb x_n - z_{n1} \pmb w_1)\\ &= \frac 1N \sum_{n=1}^N \left[ \pmb x_n^\intercal \pmb x_n -2z_{n1}\pmb w_1^\intercal \pmb x_n + z_{n1}^2 \right] \end{align*} \] since \(\pmb w_1^\intercal \pmb w_1 = 1\) (orthonormality assumption).

Optimization wrt \(z_{n1}\) yields \[ \frac{\partial}{\partial z_{n1}}\mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) = \frac 1N [-2 \pmb w_1^\intercal \pmb x_n + 2z_{n1}] = 0 \Rightarrow z_{n1} = \pmb w_1^\intercal \pmb x_n \] So the optimal embedding orthogonally projects the data onto \(\pmb w_1\).

Plug back in the loss \[ \begin{align*} \mathcal{L}(\pmb w_1) &= \frac 1N \sum_{n=1}^N [ \pmb x_n^\intercal \pmb x_n - z_{n1}^2 ] \\ &= \text{const} - \frac 1N \sum_{n=1}^N z_{n1}^2 \\ &= \text{const} -\frac 1N \sum_{n=1}^N \pmb w_1^\intercal \pmb x_n \pmb x_n^\intercal \pmb w_1 \\ &= \text{const} - \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 \end{align*} \] the last equality is true only if the data is centered.

\(\Vert \pmb w_1 \Vert \to \infty\) is a trivial solution, so we impose a constraint \(\Vert \pmb w_1 \Vert = 1\) and instead optimize \[ \tilde{\mathcal{L}}(\pmb w_1) = \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 - \lambda_1(\pmb w_1^\intercal\pmb w_1 - 1) \] which yields \[ \frac{\partial}{\partial \pmb w_1}\tilde{\mathcal{L}}(\pmb w_1) = 2\pmb{\hat\Sigma}\pmb w_1 - 2 \lambda_1 \pmb w_1 = 0 \] \[ \boxed{ \pmb{\hat\Sigma}\pmb w_1 = \lambda_1 \pmb w_1 } \] Hence the optimal direction to project the data is an eigenvector of the covariance matrix.

PCA variance projection

Choosing the number of latent dimensions¹

Scree plot: Plot of the eigenvalues \(\lambda_j\) vs \(j\) in order of decreasing magnitude
Fraction of variance explained: Same as scree plot, but cumulative

Factor analysis

Generative model¹

Factor analysis

Definition: Generalization of PCA. Defined by the following linear-Gaussian latent variable generative probabilistic model \[ \begin{align*} p(\pmb z) &= \mathcal{N}(\pmb z | \pmb \mu_0, \pmb \Sigma_0) \\ p(\pmb x | \pmb z, \pmb \theta) &= \mathcal{N}(\pmb x | {\bf W}\pmb z + \pmb \mu, \pmb \Psi) \end{align*} \] where the \(D \times L\) matrix \({\bf W}\) is the factor loading matrix, and the \(D \times D\) matrix \(\pmb \Psi\) is a covariance matrix.

Interpretation: Possible to simplify the marginal distribution to \[ p(\pmb x) = \mathcal{N}(\pmb x | \pmb \mu, \textbf{WW}^\intercal + \pmb \Psi) \] FA can be thought of as a low-rank version of a Gaussian distribution

Covariance: FA approximates the covariance matrix of the visible vector using a low-rank decomposition \[ {\bf C} = \text{Cov}[\pmb x] = {\bf WW}^\intercal + \pmb \Psi \]

Illustration: FA generative process¹

Factor analysis corresponds to so-called probabilistic PCA

Autoencoders

Encoder/decoder can be made nonlinear¹

PCA and factor analysis: They learn a linear mapping from \(\pmb x \to \pmb z\), called the encoder, \(f_\text{e}\), and learning another (linear) mapping \(\pmb z \to \pmb x\), called the decoder, \(f_\text{d}\). The overall reconstruction function has the form \(r(\pmb x) = f_\text{d}(f_\text{e}(\pmb x))\).
Nonlinear mappings: Learning nonlinear mapping via a neural network. This is called an autoencoder. Use a multilayer perceptron (MLP) and enforce an information bottleneck layer.

Illustration¹

Variational autoencoder¹

Variational autoencoder (VAE): Probabilistic version of an autoencoder. Advantage: VAE is a generative model that can create new samples. Two key ideas:
1. VAE extends factor analysis: we replace \(p(\pmb x| \pmb z) = \mathcal{N}(\pmb x | {\bf W}\pmb z, \sigma^2 \mathbb{I})\) with \[ p_\theta (\pmb x | \pmb z) = \mathcal{N}(\pmb x | f_d(\pmb z; \pmb \theta), \sigma^2 \mathbb{I}) \]
2. Parametrize an inference network: Assume the posterior is Gaussian with diagonal covariance \[ q_\phi(\pmb z | \pmb x) = \mathcal{N}(\pmb z | f_{\text{e},\mu}(\pmb x; \phi), \text{diag}(f_{\text{e},\sigma}(\pmb x; \phi))) \]

Example: MNIST

Left: VAE
Right: AE

Summary

Dimensionality reduction: Original visible space may be too large, widh to reduce it. Find mapping to a low-dimensional latent space.
Principal component analysis: Find linear and orthogonal projection. Eigenvectors oriented along the directions of largest variance of the data. Choose the number of dimensions by looking at the fraction of variance explained.
Factor analysis: Generative model. Equivalent to probabilistic PCA
Autoencoders: Find nonlinear mapping that encode/decode the data between original and latent spaces. Variational autoencoder is generative.

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Course evaluation

Introduction

Literature

Recap from last time

Idea

Idea

Principal component analysis

Principal component analysis1

PCA examples1

Derivation1

Derivation (1D case)

PCA variance projection

Choosing the number of latent dimensions1

Factor analysis

Generative model1

Illustration: FA generative process1

Autoencoders

Encoder/decoder can be made nonlinear1

Illustration1

Variational autoencoder1

Example: MNIST

Summary

Summary

References

Principal component analysis¹

PCA examples¹

Derivation¹

Choosing the number of latent dimensions¹

Generative model¹

Illustration: FA generative process¹

Encoder/decoder can be made nonlinear¹

Illustration¹

Variational autoencoder¹