Lecture 13: Dimensionality reduction
Institute for Theoretical Physics, Heidelberg University
Please complete by Jan 31, 2024.
Learn a mapping from the high-dimensional visible space, \(\pmb x \in \mathbb{R}^D\), to a low-dimensional latent space, \(\pmb z \in \mathbb{R}^L\).
Mapping can be
PCA
PCA effectively finds a rotation, and ranks the directions (eigenvectors) in terms of how much (eigenvalues) they describe the variance of the data (recall empirical covariance matrix from last slide).
Estimate the best 1d solution, \(\pmb w_1 \in \mathbb{R}^D\).1
Coefficient for the first basis vector: \(\pmb{\tilde{z}} = [z_{11}, \dots, z_{N1}] \in \mathbb{R}^N\). Reconstruction error: \[ \begin{align*} \mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) &= \frac 1N \sum_{n=1}^N (\pmb x_n - z_{n1}\pmb w_1)^\intercal(\pmb x_n - z_{n1} \pmb w_1)\\ &= \frac 1N \sum_{n=1}^N \left[ \pmb x_n^\intercal \pmb x_n -2z_{n1}\pmb w_1^\intercal \pmb x_n + z_{n1}^2 \right] \end{align*} \] since \(\pmb w_1^\intercal \pmb w_1 = 1\) (orthonormality assumption).
Optimization wrt \(z_{n1}\) yields \[ \frac{\partial}{\partial z_{n1}}\mathcal{L}(\pmb w_1, \pmb{\tilde{z}}_1) = \frac 1N [-2 \pmb w_1^\intercal \pmb x_n + 2z_{n1}] = 0 \Rightarrow z_{n1} = \pmb w_1^\intercal \pmb x_n \] So the optimal embedding orthogonally projects the data onto \(\pmb w_1\).
Plug back in the loss \[ \begin{align*} \mathcal{L}(\pmb w_1) &= \frac 1N \sum_{n=1}^N [ \pmb x_n^\intercal \pmb x_n - z_{n1}^2 ] \\ &= \text{const} - \frac 1N \sum_{n=1}^N z_{n1}^2 \\ &= \text{const} -\frac 1N \sum_{n=1}^N \pmb w_1^\intercal \pmb x_n \pmb x_n^\intercal \pmb w_1 \\ &= \text{const} - \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 \end{align*} \] the last equality is true only if the data is centered.
\(\Vert \pmb w_1 \Vert \to \infty\) is a trivial solution, so we impose a constraint \(\Vert \pmb w_1 \Vert = 1\) and instead optimize \[ \tilde{\mathcal{L}}(\pmb w_1) = \pmb w_1^\intercal \pmb{\hat\Sigma}\pmb w_1 - \lambda_1(\pmb w_1^\intercal\pmb w_1 - 1) \] which yields \[ \frac{\partial}{\partial \pmb w_1}\tilde{\mathcal{L}}(\pmb w_1) = 2\pmb{\hat\Sigma}\pmb w_1 - 2 \lambda_1 \pmb w_1 = 0 \] \[ \boxed{ \pmb{\hat\Sigma}\pmb w_1 = \lambda_1 \pmb w_1 } \] Hence the optimal direction to project the data is an eigenvector of the covariance matrix.
Factor analysis
Factor analysis corresponds to so-called probabilistic PCA