Computational Statistics & Data Analysis (MVComp2)

Lecture 13: Dimensionality reduction

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022)
Chapter 20 on Dimensionality Reduction

Recap from last time

Decision thoery
How do we choose between multiple candidate models?
Bayes factor
Compare models according to their marginal likelihoods (Bayes factor)
Occam’s razor
When unsure, choose the simpler model (fewer parameters, fewer assumptions, etc.)
Bayesian Information Criterion (BIC)
Tradeoff between maximizing likelihood and minimizing model complexity

Idea

Idea

Learn a mapping from the high-dimensional visible space, \(\pmb x \in \mathbb{R}^D\), to a low-dimensional latent space, \(\pmb z \in \mathbb{R}^L\).

Mapping can be

  • a parametric model, \(\pmb z = f(\pmb x, \pmb \theta)\)
  • nonparametric, where we compute an embedding \(\pmb z_n\) for each input \(\pmb x_n\) in the data set, but not for any other point.1

Principal component analysis

Principal component analysis1

PCA

PCA
Simplest and most widely used form of dimensionality reduction.
Idea
Find a linear and orthogonal projection of the high-dimensional data. We project or encode \(\pmb x\) to get \(\pmb z = {\bf W}^\intercal \pmb x\), and unproject or decode \(\pmb z\) to get \(\pmb{\hat x} = {\bf W}\pmb z\).
Optimization
Intuitively we want \(\pmb{\hat x}\) to be close to the original \(\pmb x\) in \(\ell_2\) distance. Define the reconstruction error \[ \mathcal{L}({\bf W}) = \frac 1N \sum_{n=1}^N \Vert \pmb x_n - \text{decode}(\text{encode}(\pmb x_n; {\bf W}); {\bf W}) \Vert_2^2 \]
Linear problem
This turns out to be equivalent to diagonalizing the empirical covariance matrix \[ \hat\Sigma = \frac 1N \sum_{n=1}^N (\pmb x_n - \bar x)(\pmb x_n - \bar x) = \frac 1N {\bf X}^\intercal_\text{c}{\bf X}_\text{c} \] where \({\bf X}_\text{c}\) is a centered version of the \(N\times D\) design matrix.

PCA examples1

PCA effectively finds a rotation, and ranks the directions (eigenvectors) in terms of how much (eigenvalues) they describe the variance of the data (recall empirical covariance matrix from last slide).

Derivation1

Setting
Consider an unlabeled dataset \(\mathcal{D} = \{ \pmb x_n : n=1: N\}\), where \(\pmb x_n \in \mathbb{R}^D\). Represent this as an \(N \times D\) matrix \({\bf X}\). Assume that the data is centered, i.e., \(\pmb{\bar x} = \frac 1N \sum_{n=1}^N \pmb x_n = 0\).
Optimization
Minimize distance between \({\bf X}\) and reconstructed data \[ \mathcal{L}({\bf W}, {\bf Z}) = \frac 1N \Vert {\bf X} - {\bf ZW}^\intercal \Vert^2 = \frac 1N \sum_{n=1}^N \Vert \pmb x_n - {\bf W}\pmb z_n\Vert^2 \]

Derivation (1D case)

Estimate the best 1d solution, \(\pmb w_1 \in \mathbb{R}^D\).1

Coefficient for the first basis vector: \(\pmb{\tilde{z}} = [z_{11}, \dots, z_{N1}] \in \mathbb{R}^N\). Reconstruction error: \[ \begin{align*} \mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) &= \frac 1N \sum_{n=1}^N (\pmb x_n - z_{n1}\pmb w_1)^\intercal(\pmb x_n - z_{n1} \pmb w_1)\\ &= \frac 1N \sum_{n=1}^N \left[ \pmb x_n^\intercal \pmb x_n -2z_{n1}\pmb w_1^\intercal \pmb x_n + z_{n1}^2 \right] \end{align*} \] since \(\pmb w_1^\intercal \pmb w_1 = 1\) (orthonormality assumption).

Optimization wrt \(z_{n1}\) yields \[ \frac{\partial}{\partial z_{n1}}\mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) = \frac 1N [-2 \pmb w_1^\intercal \pmb x_n + 2z_{n1}] = 0 \Rightarrow z_{n1} = \pmb w_1^\intercal \pmb x_n \] So the optimal embedding orthogonally projects the data onto \(\pmb w_1\).

Plug back in the loss \[ \begin{align*} \mathcal{L}(\pmb w_1) &= \frac 1N \sum_{n=1}^N [ \pmb x_n^\intercal \pmb x_n - z_{n1}^2 ] \\ &= \text{const} - \frac 1N \sum_{n=1}^N z_{n1}^2 \\ &= \text{const} -\frac 1N \sum_{n=1}^N \pmb w_1^\intercal \pmb x_n \pmb x_n^\intercal \pmb w_1 \\ &= \text{const} - \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 \end{align*} \] the last equality is true only if the data is centered.

\(\Vert \pmb w_1 \Vert \to \infty\) is a trivial solution, so we impose a constraint \(\Vert \pmb w_1 \Vert = 1\) and instead optimize \[ \tilde{\mathcal{L}}(\pmb w_1) = \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 - \lambda_1(\pmb w_1^\intercal\pmb w_1 - 1) \] which yields \[ \frac{\partial}{\partial \pmb w_1}\tilde{\mathcal{L}}(\pmb w_1) = 2\pmb{\hat\Sigma}\pmb w_1 - 2 \lambda_1 \pmb w_1 = 0 \] \[ \boxed{ \pmb{\hat\Sigma}\pmb w_1 = \lambda_1 \pmb w_1 } \] Hence the optimal direction to project the data is an eigenvector of the covariance matrix.

PCA variance projection

Choosing the number of latent dimensions1

Scree plot
Plot of the eigenvalues \(\lambda_j\) vs \(j\) in order of decreasing magnitude
Fraction of variance explained
Same as scree plot, but cumulative

Factor analysis

Generative model1

Factor analysis

Definition
Generalization of PCA. Defined by the following linear-Gaussian latent variable generative probabilistic model \[ \begin{align*} p(\pmb z) &= \mathcal{N}(\pmb z | \pmb \mu_0, \pmb \Sigma_0) \\ p(\pmb x | \pmb z, \pmb \theta) &= \mathcal{N}(\pmb x | {\bf W}\pmb z + \pmb \mu, \pmb \Psi) \end{align*} \] where the \(D \times L\) matrix \({\bf W}\) is the factor loading matrix, and the \(D \times D\) matrix \(\pmb \Psi\) is a covariance matrix.
Interpretation
Possible to simplify the marginal distribution to \[ p(\pmb x) = \mathcal{N}(\pmb x | \pmb \mu, \textbf{WW}^\intercal + \pmb \Psi) \] FA can be thought of as a low-rank version of a Gaussian distribution
Covariance
FA approximates the covariance matrix of the visible vector using a low-rank decomposition \[ {\bf C} = \text{Cov}[\pmb x] = {\bf WW}^\intercal + \pmb \Psi \]

Illustration: FA generative process1

Factor analysis corresponds to so-called probabilistic PCA

Autoencoders

Encoder/decoder can be made nonlinear1

PCA and factor analysis
They learn a linear mapping from \(\pmb x \to \pmb z\), called the encoder, \(f_\text{e}\), and learning another (linear) mapping \(\pmb z \to \pmb x\), called the decoder, \(f_\text{d}\). The overall reconstruction function has the form \(r(\pmb x) = f_\text{d}(f_\text{e}(\pmb x))\).
Nonlinear mappings
Learning nonlinear mapping via a neural network. This is called an autoencoder. Use a multilayer perceptron (MLP) and enforce an information bottleneck layer.

Illustration1

Variational autoencoder1

Variational autoencoder (VAE)
Probabilistic version of an autoencoder. Advantage: VAE is a generative model that can create new samples. Two key ideas:
1. VAE extends factor analysis
we replace \(p(\pmb x| \pmb z) = \mathcal{N}(\pmb x | {\bf W}\pmb z, \sigma^2 \mathbb{I})\) with \[ p_\theta (\pmb x | \pmb z) = \mathcal{N}(\pmb x | f_d(\pmb z; \pmb \theta), \sigma^2 \mathbb{I}) \]
2. Parametrize an inference network
Assume the posterior is Gaussian with diagonal covariance \[ q_\phi(\pmb z | \pmb x) = \mathcal{N}(\pmb z | f_{\text{e},\mu}(\pmb x; \phi), \text{diag}(f_{\text{e},\sigma}(\pmb x; \phi))) \]

Example: MNIST

  • Left: VAE
  • Right: AE

Summary

Summary

Dimensionality reduction
Original visible space may be too large, widh to reduce it. Find mapping to a low-dimensional latent space.
Principal component analysis
Find linear and orthogonal projection. Eigenvectors oriented along the directions of largest variance of the data. Choose the number of dimensions by looking at the fraction of variance explained.
Factor analysis
Generative model. Equivalent to probabilistic PCA
Autoencoders
Find nonlinear mapping that encode/decode the data between original and latent spaces. Variational autoencoder is generative.

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.