Computational Statistics & Data Analysis (MVComp2)

Lecture 11: Kernel methods

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022)

  • Chapter 17: Chapter on nonparametric models > kernel methods

Recap from last time

\(K\)-nearest neighbors (KNN)
One of the simplest and most intuitive classifiers. Great in low dimension.
Curse of dimensionality
In high dimension, all points are far away. Difficult to use methods that rely on local densities. Instead: can use parametric models
Logistic regression
Binary case: Bernoulli distribution with sigmoid (logistic) function. Function defines a linear hyperplane that separates the two classes.
Linear discriminant analysis
Generative classifier (can generate features \(\pmb x\) for each class), with posterior that is a linear function of \(\pmb x\)

Feature transformation

Example: Concentric circles

  • Data in 2D is not linearly separable
  • But consider feature map \[ \psi\left(\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\right) = \begin{bmatrix} x_1 \\ x_2 \\ \sqrt{x_1^2 + x_2^2} \end{bmatrix} \] problem is now linearly separable.

Kernel methods

Nonparametric methods

Nonparametric methods

So far, we have considered parametric methods for regression and classification.

Nonparametric methods
Do not assume a fixed parametric form, instead try to estimate the prediction function itself (rather than the parameters).
Key idea
Observe the function value at a fixed set of \(N\) points, \(y_n = f(\pmb{x}_n)\), where \(f\) is the unknown function. Prediction for a new point: \(f(\pmb{x}^*)\) is a weighted combination of the \(\{ f(\pmb x_n)\}\).
Intuition
Method based on comparing “how similar” \(\pmb x^*\) is to each one of the \(N\) training points, \(\{ \pmb x_n\}\).
Disadvantage
We need to remember the entire training set, \(\mathcal{D} = \{ (\pmb x_n, y_n)\}\) at inference time (i.e., cannot compress \(\mathcal{D}\)).

Kernel ridge regression

Kernel function1

Kernel function (Mercer kernel)

Encode similarity of two input vectors
if we know that \(\pmb x_i\) is similar to \(\pmb x_j\), then we can encourage the model to make the predicted output at both locations (i.e., \(f(\pmb x_i)\) and \(f(\pmb x_j)\)) to be similar.
Mercer kernel (positive definite kernel)
Any symmetric function \(\mathcal{K}: \mathcal{X} \times \mathcal{X} \to \mathrm{R}^+\) such that \[ \sum_{i=1}^N\sum_{j=1}^N \mathcal{K}(\pmb x_i, \pmb x_j) c_i c_j \geq 0 \] for any set of \(N\) (unique) points \(\pmb x_i \in \mathcal{X}\) and any choice of numbers \(c_i \in \mathrm{R}\).

Gram matrix1

Gram matrix

Equivalent condition (to Mercer kernel)
Given \(N\) datapoints, the Gram matrix is the following \(N \times N\) similarity matrix \[ {\bf K} = \begin{bmatrix} \mathcal{K}(\pmb x_1, \pmb x_1) & \dots & \mathcal{K}(\pmb x_1, \pmb x_N) \\ & \dots & \\ \mathcal{K}(\pmb x_N, \pmb x_1) & \dots & \mathcal{K}(\pmb x_N, \pmb x_N) \end{bmatrix} \] \(\mathcal{K}\) is a Mercer kernel iff the Gram matrix is positive definite for any set of distinct inputs.

Mercer’s theorem1

Eigendecomposition

If \({\bf K}\) is positive definite, then we can write the eigendecomposition \({\bf K} = {\bf U}^\intercal \pmb \Lambda {\bf U}\), where \(\pmb \Lambda\) is a diagonal matrix of eigenvalues, \(\lambda_i > 0\), and \({\bf U}\) contains the eigenvectors.

Matrix element

Consider element \((i, j)\) of \({\bf K}\): \[ k_{ij} = (\pmb \Lambda^{\frac 12} {\bf U}_{:i})^\intercal (\pmb \Lambda^{\frac 12} {\bf U}_{:j}) \] where \({\bf U}_{:i}\) is the \(i\)’th column of \({\bf U}\).

Inner product

Now let’s define \(\pmb \phi(\pmb x_i) = \pmb \Lambda^{\frac 12} {\bf U}_{:i}\), we can then write \[ k_{ij} = \pmb \phi(\pmb x_i)^\intercal \pmb \phi(\pmb x_j) = \sum_m \phi_m(\pmb x_i) \phi_m(\pmb x_j) \]

Mercer’s theorem

The entries in the kernel matrix can be computed by performing an inner product of some feature vectors that are implicitly defined by the eigenvectors of the kernel matrix.

Kernel trick1

Kernel trick

We don’t actually need to define the feature representation, \(\phi(\pmb x)\). The kernel only requires a definition of the inner product.

In the following: linear regresion in an (implicit!) high-dimensional Hilbert space.

Quadratic kernel

Quadratic kernel: \(\mathcal{K}(\pmb x, \pmb x') = \langle \pmb x, \pmb x' \rangle^2\)

In 2D, we have \[ \begin{align*} \mathcal{K}(\pmb x, \pmb x') &= (x_1 x_1' + x_2 x_2')^2 \\ &= x_1^2x_1'^2 + 2x_1 x_2 x_1' x_2' + x_2^2 x_2'^2 \end{align*} \]

We can write this as \(\mathcal{K}(\pmb x, \pmb x') = \pmb \phi(\pmb x)^\intercal \pmb \phi(\pmb x)\) if we define \(\phi(x_1, x_2) = [x_1^2, \sqrt 2 x_1 x_2, x_2^2] \in \mathbb{R}^3\).

We embed the 2d inputs \(\pmb x\) into a 3d feature space \(\pmb \phi(\pmb x)\).

Gaussian (RBF) kernel1

Gaussian kernel (or RBF kernel)
Widely used kernel for real-valued inputs \[ \mathcal{K}(\pmb x, \pmb x') = \exp\left(- \frac{\Vert \pmb x - \pmb x'\Vert^2}{2\ell^2} \right) \] where \(\ell\) is the length scale (or bandwidth) of the kernel, i.e., the distance over which we expect differences to matter.

Kernel-ridge regression

Recall MAP estimate for ridge regression from Lecture 8: \[ \pmb{\hat w}_\text{map} = ({\bf X^\intercal X} + \lambda \mathbb{I}_D)^{-1} {\bf X^\intercal} \pmb y \]

Rewrite equation using matrix inversion lemma1 \[ \pmb{w} = {\bf X^\intercal} \underbrace{({\bf XX^\intercal} + \lambda \mathbb{I}_N)^{-1}{\bf y}}_{\pmb \alpha} \] where \(\pmb \alpha\) is a vector of size \(N\). So the solution vector is simply a linear sum of the \(N\) training vectors: \[ \pmb{w} = {\bf X^\intercal}\pmb{\alpha} = \sum_{n=1}^N \alpha_n \pmb{x}_n. \] Use at test time to compute the predictive mean \[ f(\pmb{x}; \pmb{w}) = \pmb{w}^\intercal \pmb{x} = \sum_{n=1}^N \alpha_n \pmb{x}_n^\intercal \pmb{x} \]

Use the kernel trick to rewrite the predictive output as \[ \boxed{ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) } \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\). Note that the solution vector \(\alpha\) is not sparse, i.e., predictions at test time will take \(O(N)\) time.

Gaussian processes

Intuition

Sampled functions given 0, 1, 2, and 4 training points.

Model error bars!

Akin to KRR1

Noisy observations
We observe a noisy version of the underlying function \[ y_n = f(\pmb x_n) + \epsilon_n, \qquad \epsilon_n \sim \mathcal{N}(0, \sigma_y^2) \]
Covariance
Covariance of the observed noisy responses is \[ \text{Cov}[y_i, y_j] = \text{Cov}[f_i, f_j] + \text{Cov}[\epsilon_i, \epsilon_j] = \mathcal{K}(\pmb x_i, \pmb x_j) + \sigma^2_y \delta_{ij} \] such that \[ \text{Cov}[\pmb y | {\bf X}] = {\bf K}_{X, X} + \sigma^2_y \mathbb{I}_N = {\bf K}_\sigma \]
Joint density
Joint density of observed data and noise-free function on the test points is \[ \begin{pmatrix} \pmb y \\ \pmb f_* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \pmb \mu_X \\ \pmb \mu_* \end{pmatrix}, \begin{pmatrix} {\bf K}_\sigma & {\bf K}_{X,*} \\ {\bf K}_{X,*}^\intercal & {\bf K}_{*,*} \end{pmatrix} \right) \] to obtain a posterior predictive density at a set of test points \({\bf X}_*\) \[ \begin{align*} p(\pmb f_* | \mathcal{D}, {\bf X}_*) &= \mathcal{N}(\pmb f_* | \mu_{* | X}, \pmb \Sigma_{*|X}) \\ \pmb \mu_{*|X} &= \pmb \mu_* + \textbf{K}^\intercal_{X,*} \textbf{K}^{-1}_{\sigma}(\pmb y - \pmb \mu_X) \\ \pmb \Sigma_{*|X} &= \textbf{K}_{*,*} - \textbf{K}^\intercal_{X,*} \textbf{K}^{-1}_\sigma \textbf{K}_{X,*} \end{align*} \] In quite general case, we can write the posterior mean as \[ \mu_{*|X} = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_*, \pmb x_n) \] which is identical to the predictions from kernel ridge regression.

Incorporation of error bars1

Support vector machines

Motivation

Kernel-ridge regression

Recall the predictive output \[ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\).

What about classification?

Large margin principle

Left
a separating hyperplane with large margin
Right
a separating hyperplane with small margin

Large margin classifiers1

Example: Binary classifier
Consider a binary classifier (labels -1 and +1) of the form \(h(\pmb x) = \text{sign}(f(\pmb x))\) where the decision boundary is given by the linear function \[ f(\pmb x) = \pmb w^\intercal \pmb x + w_0 \] Find the line (or hyperplane) that yields maximum margin.
Distance of a point to the decision boundary
Compute orthogonal projection of point \(\pmb x\) to the decision boundary \[ \pmb x = \pmb x_\perp + r \frac {\pmb x}{\Vert \pmb w \Vert} \]

Large margin classifiers

Strategy
Maximize distance between point and projection to the decision boundary, \(r\). This yields the objective1 \[ \underset{\pmb w, w_0}{\max} \frac 1{\Vert \pmb w \Vert} \overset{N}{\underset{n=1}{\min}} \left[ \tilde y_n (\pmb w^\intercal \pmb x_n + w_0) \right], \] where \(\tilde y_n \in \{-1, 1\}\). The expression maximizes the distance of the closest point.

Classification example: Moon data1

Hyperparameters
\(\gamma\) controls the RBF bandwidth, \(C\) is a regularizer

Summary

Summary

Kernel methods
Nonparametric methods do not assume a fixed parametric form, instead they rely on similarity with training points
Kernel trick
Input features are implicitly compared in higher-dimensional (Hilbert) feature space
Gaussian process regression
Posterior distribution. Extends kernel ridge regression to yield prediction errors
Support vector machines
Classification (following large margin principle), but also regression, to make predictions with fewer data points (support vectors)

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.