Lecture 11: Kernel methods
Institute for Theoretical Physics, Heidelberg University
Please complete by Jan 31, 2024.
Murphy (2022)
Nonparametric methods
So far, we have considered parametric methods for regression and classification.
Kernel function (Mercer kernel)
Gram matrix
If \({\bf K}\) is positive definite, then we can write the eigendecomposition \({\bf K} = {\bf U}^\intercal \pmb \Lambda {\bf U}\), where \(\pmb \Lambda\) is a diagonal matrix of eigenvalues, \(\lambda_i > 0\), and \({\bf U}\) contains the eigenvectors.
Consider element \((i, j)\) of \({\bf K}\): \[ k_{ij} = (\pmb \Lambda^{\frac 12} {\bf U}_{:i})^\intercal (\pmb \Lambda^{\frac 12} {\bf U}_{:j}) \] where \({\bf U}_{:i}\) is the \(i\)’th column of \({\bf U}\).
Now let’s define \(\pmb \phi(\pmb x_i) = \pmb \Lambda^{\frac 12} {\bf U}_{:i}\), we can then write \[ k_{ij} = \pmb \phi(\pmb x_i)^\intercal \pmb \phi(\pmb x_j) = \sum_m \phi_m(\pmb x_i) \phi_m(\pmb x_j) \]
Mercer’s theorem
The entries in the kernel matrix can be computed by performing an inner product of some feature vectors that are implicitly defined by the eigenvectors of the kernel matrix.
Kernel trick
We don’t actually need to define the feature representation, \(\phi(\pmb x)\). The kernel only requires a definition of the inner product.
In the following: linear regresion in an (implicit!) high-dimensional Hilbert space.
Quadratic kernel: \(\mathcal{K}(\pmb x, \pmb x') = \langle \pmb x, \pmb x' \rangle^2\)
In 2D, we have \[ \begin{align*} \mathcal{K}(\pmb x, \pmb x') &= (x_1 x_1' + x_2 x_2')^2 \\ &= x_1^2x_1'^2 + 2x_1 x_2 x_1' x_2' + x_2^2 x_2'^2 \end{align*} \]
We can write this as \(\mathcal{K}(\pmb x, \pmb x') = \pmb \phi(\pmb x)^\intercal \pmb \phi(\pmb x)\) if we define \(\phi(x_1, x_2) = [x_1^2, \sqrt 2 x_1 x_2, x_2^2] \in \mathbb{R}^3\).
We embed the 2d inputs \(\pmb x\) into a 3d feature space \(\pmb \phi(\pmb x)\).
Recall MAP estimate for ridge regression from Lecture 8: \[ \pmb{\hat w}_\text{map} = ({\bf X^\intercal X} + \lambda \mathbb{I}_D)^{-1} {\bf X^\intercal} \pmb y \]
Rewrite equation using matrix inversion lemma1 \[ \pmb{w} = {\bf X^\intercal} \underbrace{({\bf XX^\intercal} + \lambda \mathbb{I}_N)^{-1}{\bf y}}_{\pmb \alpha} \] where \(\pmb \alpha\) is a vector of size \(N\). So the solution vector is simply a linear sum of the \(N\) training vectors: \[ \pmb{w} = {\bf X^\intercal}\pmb{\alpha} = \sum_{n=1}^N \alpha_n \pmb{x}_n. \] Use at test time to compute the predictive mean \[ f(\pmb{x}; \pmb{w}) = \pmb{w}^\intercal \pmb{x} = \sum_{n=1}^N \alpha_n \pmb{x}_n^\intercal \pmb{x} \]
Use the kernel trick to rewrite the predictive output as \[ \boxed{ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) } \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\). Note that the solution vector \(\alpha\) is not sparse, i.e., predictions at test time will take \(O(N)\) time.
Sampled functions given 0, 1, 2, and 4 training points.
Model error bars!
Kernel-ridge regression
Recall the predictive output \[ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\).
What about classification?