Computational Statistics & Data Analysis (MVComp2)

Lecture 11: Kernel methods

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Course evaluation

Please complete by Jan 31, 2024.

https://uni-heidelberg.evasys.de/evasys/online.php?p=DTHA4

Introduction

Literature

Murphy (2022)

Chapter 17: Chapter on nonparametric models > kernel methods

Recap from last time

\(K\)-nearest neighbors (KNN): One of the simplest and most intuitive classifiers. Great in low dimension.
Curse of dimensionality: In high dimension, all points are far away. Difficult to use methods that rely on local densities. Instead: can use parametric models
Logistic regression: Binary case: Bernoulli distribution with sigmoid (logistic) function. Function defines a linear hyperplane that separates the two classes.
Linear discriminant analysis: Generative classifier (can generate features \(\pmb x\) for each class), with posterior that is a linear function of \(\pmb x\)

Feature transformation

Example: Concentric circles

Data in 2D is not linearly separable
But consider feature map \[ \psi\left(\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\right) = \begin{bmatrix} x_1 \\ x_2 \\ \sqrt{x_1^2 + x_2^2} \end{bmatrix} \] problem is now linearly separable.

Kernel methods

Nonparametric methods

Nonparametric methods

So far, we have considered parametric methods for regression and classification.

Nonparametric methods: Do not assume a fixed parametric form, instead try to estimate the prediction function itself (rather than the parameters).

Key idea: Observe the function value at a fixed set of \(N\) points, \(y_n = f(\pmb{x}_n)\), where \(f\) is the unknown function. Prediction for a new point: \(f(\pmb{x}^*)\) is a weighted combination of the \(\{ f(\pmb x_n)\}\).

Intuition: Method based on comparing “how similar” \(\pmb x^*\) is to each one of the \(N\) training points, \(\{ \pmb x_n\}\).

Disadvantage: We need to remember the entire training set, \(\mathcal{D} = \{ (\pmb x_n, y_n)\}\) at inference time (i.e., cannot compress \(\mathcal{D}\)).

Kernel ridge regression

Kernel function¹

Kernel function (Mercer kernel)

Encode similarity of two input vectors: if we know that \(\pmb x_i\) is similar to \(\pmb x_j\), then we can encourage the model to make the predicted output at both locations (i.e., \(f(\pmb x_i)\) and \(f(\pmb x_j)\)) to be similar.
Mercer kernel (positive definite kernel): Any symmetric function \(\mathcal{K}: \mathcal{X} \times \mathcal{X} \to \mathrm{R}^+\) such that \[ \sum_{i=1}^N\sum_{j=1}^N \mathcal{K}(\pmb x_i, \pmb x_j) c_i c_j \geq 0 \] for any set of \(N\) (unique) points \(\pmb x_i \in \mathcal{X}\) and any choice of numbers \(c_i \in \mathrm{R}\).

Gram matrix¹

Gram matrix

Equivalent condition (to Mercer kernel): Given \(N\) datapoints, the Gram matrix is the following \(N \times N\) similarity matrix \[ {\bf K} = \begin{bmatrix} \mathcal{K}(\pmb x_1, \pmb x_1) & \dots & \mathcal{K}(\pmb x_1, \pmb x_N) \\ & \dots & \\ \mathcal{K}(\pmb x_N, \pmb x_1) & \dots & \mathcal{K}(\pmb x_N, \pmb x_N) \end{bmatrix} \] \(\mathcal{K}\) is a Mercer kernel iff the Gram matrix is positive definite for any set of distinct inputs.

Mercer’s theorem¹

Eigendecomposition

If \({\bf K}\) is positive definite, then we can write the eigendecomposition \({\bf K} = {\bf U}^\intercal \pmb \Lambda {\bf U}\), where \(\pmb \Lambda\) is a diagonal matrix of eigenvalues, \(\lambda_i > 0\), and \({\bf U}\) contains the eigenvectors.

Matrix element

Consider element \((i, j)\) of \({\bf K}\): \[ k_{ij} = (\pmb \Lambda^{\frac 12} {\bf U}_{:i})^\intercal (\pmb \Lambda^{\frac 12} {\bf U}_{:j}) \] where \({\bf U}_{:i}\) is the \(i\)’th column of \({\bf U}\).

Inner product

Now let’s define \(\pmb \phi(\pmb x_i) = \pmb \Lambda^{\frac 12} {\bf U}_{:i}\), we can then write \[ k_{ij} = \pmb \phi(\pmb x_i)^\intercal \pmb \phi(\pmb x_j) = \sum_m \phi_m(\pmb x_i) \phi_m(\pmb x_j) \]

Mercer’s theorem

The entries in the kernel matrix can be computed by performing an inner product of some feature vectors that are implicitly defined by the eigenvectors of the kernel matrix.

Kernel trick¹

Kernel trick

We don’t actually need to define the feature representation, \(\phi(\pmb x)\). The kernel only requires a definition of the inner product.

In the following: linear regresion in an (implicit!) high-dimensional Hilbert space.

Quadratic kernel

Quadratic kernel: \(\mathcal{K}(\pmb x, \pmb x') = \langle \pmb x, \pmb x' \rangle^2\)

In 2D, we have \[ \begin{align*} \mathcal{K}(\pmb x, \pmb x') &= (x_1 x_1' + x_2 x_2')^2 \\ &= x_1^2x_1'^2 + 2x_1 x_2 x_1' x_2' + x_2^2 x_2'^2 \end{align*} \]

We can write this as \(\mathcal{K}(\pmb x, \pmb x') = \pmb \phi(\pmb x)^\intercal \pmb \phi(\pmb x)\) if we define \(\phi(x_1, x_2) = [x_1^2, \sqrt 2 x_1 x_2, x_2^2] \in \mathbb{R}^3\).

We embed the 2d inputs \(\pmb x\) into a 3d feature space \(\pmb \phi(\pmb x)\).

Gaussian (RBF) kernel¹

Gaussian kernel (or RBF kernel): Widely used kernel for real-valued inputs \[ \mathcal{K}(\pmb x, \pmb x') = \exp\left(- \frac{\Vert \pmb x - \pmb x'\Vert^2}{2\ell^2} \right) \] where \(\ell\) is the length scale (or bandwidth) of the kernel, i.e., the distance over which we expect differences to matter.

Kernel-ridge regression

Recall MAP estimate for ridge regression from Lecture 8: \[ \pmb{\hat w}_\text{map} = ({\bf X^\intercal X} + \lambda \mathbb{I}_D)^{-1} {\bf X^\intercal} \pmb y \]

Rewrite equation using matrix inversion lemma¹ \[ \pmb{w} = {\bf X^\intercal} \underbrace{({\bf XX^\intercal} + \lambda \mathbb{I}_N)^{-1}{\bf y}}_{\pmb \alpha} \] where \(\pmb \alpha\) is a vector of size \(N\). So the solution vector is simply a linear sum of the \(N\) training vectors: \[ \pmb{w} = {\bf X^\intercal}\pmb{\alpha} = \sum_{n=1}^N \alpha_n \pmb{x}_n. \] Use at test time to compute the predictive mean \[ f(\pmb{x}; \pmb{w}) = \pmb{w}^\intercal \pmb{x} = \sum_{n=1}^N \alpha_n \pmb{x}_n^\intercal \pmb{x} \]

Use the kernel trick to rewrite the predictive output as \[ \boxed{ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) } \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\). Note that the solution vector \(\alpha\) is not sparse, i.e., predictions at test time will take \(O(N)\) time.

Gaussian processes

Intuition

Sampled functions given 0, 1, 2, and 4 training points.

Model error bars!

Akin to KRR¹

Noisy observations: We observe a noisy version of the underlying function \[ y_n = f(\pmb x_n) + \epsilon_n, \qquad \epsilon_n \sim \mathcal{N}(0, \sigma_y^2) \]

Covariance: Covariance of the observed noisy responses is \[ \text{Cov}[y_i, y_j] = \text{Cov}[f_i, f_j] + \text{Cov}[\epsilon_i, \epsilon_j] = \mathcal{K}(\pmb x_i, \pmb x_j) + \sigma^2_y \delta_{ij} \] such that \[ \text{Cov}[\pmb y | {\bf X}] = {\bf K}_{X, X} + \sigma^2_y \mathbb{I}_N = {\bf K}_\sigma \]

Joint density: Joint density of observed data and noise-free function on the test points is \[ \begin{pmatrix} \pmb y \\ \pmb f_* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \pmb \mu_X \\ \pmb \mu_* \end{pmatrix}, \begin{pmatrix} {\bf K}_\sigma & {\bf K}_{X,*} \\ {\bf K}_{X,*}^\intercal & {\bf K}_{*,*} \end{pmatrix} \right) \] to obtain a posterior predictive density at a set of test points \({\bf X}_*\) \[ \begin{align*} p(\pmb f_* | \mathcal{D}, {\bf X}_*) &= \mathcal{N}(\pmb f_* | \mu_{* | X}, \pmb \Sigma_{*|X}) \\ \pmb \mu_{*|X} &= \pmb \mu_* + \textbf{K}^\intercal_{X,*} \textbf{K}^{-1}_{\sigma}(\pmb y - \pmb \mu_X) \\ \pmb \Sigma_{*|X} &= \textbf{K}_{*,*} - \textbf{K}^\intercal_{X,*} \textbf{K}^{-1}_\sigma \textbf{K}_{X,*} \end{align*} \] In quite general case, we can write the posterior mean as \[ \mu_{*|X} = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_*, \pmb x_n) \] which is identical to the predictions from kernel ridge regression.

Incorporation of error bars¹

Support vector machines

Motivation

Kernel-ridge regression

Recall the predictive output \[ f(\pmb x; \pmb w) = \sum_{n=1}^N \alpha_n \mathcal{K}(\pmb x_n, \pmb x) \] where \(\alpha = ({\bf K} + \lambda \mathbb{I}_N)^{-1}\pmb y\).

What about classification?

Large margin principle

Left: a separating hyperplane with large margin
Right: a separating hyperplane with small margin

Large margin classifiers¹

Example: Binary classifier: Consider a binary classifier (labels -1 and +1) of the form \(h(\pmb x) = \text{sign}(f(\pmb x))\) where the decision boundary is given by the linear function \[ f(\pmb x) = \pmb w^\intercal \pmb x + w_0 \] Find the line (or hyperplane) that yields maximum margin.
Distance of a point to the decision boundary: Compute orthogonal projection of point \(\pmb x\) to the decision boundary \[ \pmb x = \pmb x_\perp + r \frac {\pmb x}{\Vert \pmb w \Vert} \]

Large margin classifiers

Strategy: Maximize distance between point and projection to the decision boundary, \(r\). This yields the objective¹ \[ \underset{\pmb w, w_0}{\max} \frac 1{\Vert \pmb w \Vert} \overset{N}{\underset{n=1}{\min}} \left[ \tilde y_n (\pmb w^\intercal \pmb x_n + w_0) \right], \] where \(\tilde y_n \in \{-1, 1\}\). The expression maximizes the distance of the closest point.

Classification example: Moon data¹

Hyperparameters: \(\gamma\) controls the RBF bandwidth, \(C\) is a regularizer

Summary

Kernel methods: Nonparametric methods do not assume a fixed parametric form, instead they rely on similarity with training points
Kernel trick: Input features are implicitly compared in higher-dimensional (Hilbert) feature space
Gaussian process regression: Posterior distribution. Extends kernel ridge regression to yield prediction errors
Support vector machines: Classification (following large margin principle), but also regression, to make predictions with fewer data points (support vectors)

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Course evaluation

Introduction

Literature

Recap from last time

Feature transformation

Example: Concentric circles

Kernel methods

Nonparametric methods

Kernel ridge regression

Kernel function1

Gram matrix1

Mercer’s theorem1

Eigendecomposition

Matrix element

Inner product

Kernel trick1

Quadratic kernel

Gaussian (RBF) kernel1

Kernel-ridge regression

Gaussian processes

Intuition

Akin to KRR1

Incorporation of error bars1

Support vector machines

Motivation

Large margin principle

Large margin classifiers1

Large margin classifiers

Classification example: Moon data1

Summary

Summary

References

Kernel function¹

Gram matrix¹

Mercer’s theorem¹

Kernel trick¹

Gaussian (RBF) kernel¹

Akin to KRR¹

Incorporation of error bars¹

Large margin classifiers¹

Classification example: Moon data¹