Computational Statistics & Data Analysis (MVComp2)

Lecture 10: Classification

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

  • Murphy (2022)
    • Chapter 16: KNN classification
    • Chapter 10: binary/multinomial logistic regression, linear/nonlinear classification
    • Chapter 9: discriminative/generative classifiers, Gaussian discriminant analysis, LDA vs logistic regression

Recap from last time

Regression splines
provide a local approximation to the function. Used together with a linear combination of parameters, so as to retain convex optimization
B-spline basis functions
Offer a degree \(D\) of continuous derivatives. Cubic (\(D=3\)) often used.
Neural networks
composition of linear transformations connected by differentiable activation functions
Differentiability
key property to ensure efficient learning. Enabled/optimized by means of automatic differentiation (chain rule).

Classification

Visual illustration

Goal
Identify which class an object belongs to.
Figure
\(H_1\) and \(H_2\) classifiers correctly classify between black and white dots. \(H_2\) considered better because it is further from both groups. \(H_3\) fails.

\(K\)-nearest neighbors (KNN) classification

KNN classification1

\(K\)-nearest neighbors2

  • One of the simplest classifiers!
  • For a new input \(\pmb{x}\):
    • Find the \(K\) closest examples to \(\pmb{x}\) in the training set, \(N_K(\pmb{x}, \mathcal{D})\), and look at their labels
    • Derive a distribution over the outputs for the local region around \(\pmb{x}\):3 \[ p(y = c | \pmb{x}, \mathcal{D}) = \frac 1K \sum_{n \in N_K(\pmb{x}, \mathcal{D})} \mathbb{I}(y_n = c) \]

Two main parameters:

  1. Size of the neighborhood, \(K\)
  2. Distance metric, \(d(\pmb{x}, \pmb{x}')\)

KNN classification

Classify the input green dot as blue or red?

Small radius (solid circle)
green dot surrounded by 2 red triangles and 1 blue square. Classify as red triangle
Large radius (dotted circle)
green dot surrounded by 2 red triangles and 3 blue squares. Classify as blue square

Curse of dimensionality

Note

KNN classifiers do not work well with high-dimensional inputs, due to the curse of dimensionality.

Volume of space grows exponentially fast with dimension, so nearest neighbor might be quite far away.

2 solutions:

  1. make assumptions about the form of the classifier function (i.e., parametric model)1
  2. use a metric that only cares about a subset of the dimensions2

Logistic regression

Logistic regression

Logistic regression

Discriminative classification model \(p(y | \pmb{x}; \pmb{\theta})\), where

  • \(\pmb{x} \in \mathbb{R}^D\) is a fixed-dimensional input vector
  • \(y \in \{ 1, \dots, C \}\) is the class label
    • \(C=2\) for binary logistic regression
    • \(C>2\) for multiclass logistic regression
  • \(\pmb{\theta}\) are the model parameters

Logistic regression

Binary logistic regression1

Binary logistic regression

\[ p(y | \pmb{x}, \pmb{\theta}) = \text{Ber}(y | \sigma(\pmb{w}^\intercal \pmb{x} + b)) \] where \(\sigma\) is the sigmoid (logistic) function defined in lecture 9 (Neural networks). In other words \[ p(y=1 | \pmb{x}, \pmb{\theta}) = \sigma(a) = \frac 1{1 + {\rm e}^{-a}}, \] where \[ a = \pmb{w}^\intercal \pmb{x} + b = \log \left( \frac p{1-p} \right) \] \(a\) is called the logit or pre-activation.

Linear classifier1

  • Sigmoid: probability that the class label is \(y=1\).
  • Optimal decision: predict \(y=1\) iff class 1 is more likely than class 0

\[ \begin{align*} f(\pmb{x}) &= \mathbb{I}\left[p(y=1| \pmb{x}) > p(y=0 | \pmb{x})\right] \\ &= \mathbb{I} \left( \log{\frac{p(y=1 | \pmb{x})}{p(y=0 | \pmb{x})} > 0} \right) \\ &= \mathbb{I}(a > 0) \end{align*} \]

but remember that \(a = \pmb{w}^\intercal \pmb{x} + b\). We can thus write the function as follows: \[ f(\pmb{x}; \pmb{\theta}) = b + \pmb{w}^\intercal x = b + \sum_{d=1}^D w_d x_d \] where \(\pmb{w}^\intercal x = \langle \pmb w, \pmb x \rangle\) is the inner product between the weight and feature vectors. The function defines a linear hyperplane, with normal vector \(\pmb w \in \mathbb{R}^D\) and an offset \(b \in \mathbb{R}\) from the origin.

Maximum likelihood estimation

Absorb the bias term \(b\) into the weight vector \(\pmb{w}\) \[ \begin{align*} \text{NLL}\pmb{w} &= - \frac{1}{N} \log p(\mathcal{D} | \pmb{w}) = -\frac{1}{N} \log \prod_{n=1}^N \text{Ber}(y_n | \mu_n) \\ &= \frac 1N \sum_{n=1}^N [y_n \log \mu_n + (1-y_n)\log(1-\mu_n)] \\ &= \frac 1N \sum_{n=1}^N \mathbb{H}(y_n, \mu_n) \end{align*} \] where \(\mathbb{H}\) is the binary cross entropy.1

MLE: Optimizing the objective

Finding the MLE corresponds to solving \[ \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) = 0. \] To start, note that for \(\mu_n = \sigma(a_n)\) and \(a_n = \pmb{w}^\intercal\pmb{x}_n\), \[ \frac{\text{d}\mu_n}{\text{d}a_n} = \frac{\text{d}\sigma(a_n)}{\text{d}a_n} = \frac{\text{d}}{\text{d}a_n}\left( \frac 1{1 + \text{e}^{-a_n}}\right) = \sigma(a_n)\left(1-\sigma(a_n)\right). \]

Thus, by the chain rule \[ \frac{\partial}{\partial w_d} \mu_n = \frac{\partial}{\partial w_d} \sigma(\pmb{w}^\intercal\pmb{x}_n) = \frac{\partial}{\partial a_n} \sigma(a_n) \frac{\partial a_n}{\partial w_d} = \mu_n(1 - \mu_n) x_{nd}. \]

Ignoring the bias term, we end up with \[ \begin{align*} \nabla_{\pmb{w}} \log(\mu_n) &= \frac 1{\mu_n} \nabla_{\pmb{w}} \mu_n = (1-\mu_n)\pmb{x}_n \\ \nabla_{\pmb{w}} \log(1-\mu_n) &= \frac{-\mu_n (1-\mu_n)\pmb{x}_n}{1 - \mu_n} = -\mu_n \pmb{x}_n. \end{align*} \]

Combining all the relevant terms we get \[ \begin{align*} \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) &= -\frac 1N \sum_{n=1}^N [y_n \nabla_{\pmb{w}}\log \mu_n + (1-y_n)\nabla_{\pmb{w}}\log(1-\mu_n)] \\ &= -\frac 1N \sum_{n=1}^N [y_n (1 - \mu_n)\pmb{x}_n - (1-y_n) \mu_n \pmb{x}_n] \\ &= -\frac 1N \sum_{n=1}^N [y_n\pmb{x}_n - \mu_n \pmb{x}_n \mu_n - \pmb{x}_n\mu_n + y_n\pmb{x}_n\mu_n] \\ &= \frac 1N \sum_{n=1}^N (\mu_n - y_n)\pmb{x}_n. \end{align*} \] where we can interpret \(e_n = \mu_n - y_n\) as an error signal.1 The gradient weights each input \(\pmb{x}_n\) by its error, and then averages the result.

MLE: Hessian

We have all we need to compute the Hessian \[ \begin{align*} \pmb{H}(\pmb{w}) = \nabla_{\pmb{w}}\nabla_{\pmb{w}}^\intercal \text{NLL}(\pmb{w}) = \frac 1N \sum_{n=1}^N (\mu_n (1 - \mu_n) \pmb{x}_n)\pmb{x}_n^\intercal = \frac 1N \pmb{X}^\intercal\pmb{SX} \end{align*} \] which is positive definite, because \[ \pmb{v}^\intercal\pmb{X}^\intercal\pmb{SXv} = \Vert \pmb{v}^\intercal\pmb{X}^\intercal \pmb{S}^{1/2} \Vert_2^2 > 0 \text{ for any }v > 0 \]

Nonlinear classifier1

Same trick as for nonlinear regression
preprocess the inputs in a suitable way. Transform input feature vector using \(\pmb \phi(\pmb x)\)

Example

  • Use \(\pmb \phi(x_1, x_2) = [1, x_1^2, x_2^2]\) and let \(\pmb w = [-R^2, 1, 1]\).
  • Then \(\pmb{w}^\intercal \pmb \phi(\pmb{x}) = x_1^2 + x_2^2 - R^2\), so the decision boundary defines a circle with radius \(R\).

Even more expressive
learn the parameters of the feature extractor, \(\pmb \phi(\pmb x)\)

Multinomial logistic regression1

Straightforward generalization of the binary logistic regression to \(C>2\) categories \[ p(y=c | \pmb x, \pmb \theta) = \frac {{\rm e}^{a_c}}{\sum_{c'=1}^C {\rm e}^{a_{c'}}} \] where \(\pmb a = {\bf W}\pmb x\) is the \(C\)-dimensional vector of logits. The normalization condition on the total probability allows us to set \(\pmb w_C = 0\).2

Linear discriminant analysis

Discriminative vs generative classifier1

Discriminative classifier
Directly model the class posterior \(p(y | \pmb x, \pmb \theta)\)
Generative classifier
Classification model that specifies a way to generate the features \(\pmb x\) for each class \(c\):2 \[ p(y = c | \pmb x, \pmb \theta) = \frac {p(\pmb x | y = c, \pmb \theta) p(y = c| \pmb \theta)} {\sum_{c'} p(\pmb x | y = c', \pmb \theta) p(y = c' | \pmb \theta)} \]

Linear discriminant analysis (LDA)1

\[ p(y = c | \pmb x, \pmb \theta) = \frac {p(\pmb x | y = c, \pmb \theta) p(y = c| \pmb \theta)} {\sum_{c'} p(\pmb x | y = c', \pmb \theta) p(y = c' | \pmb \theta)} \] Special case of a generative classifier where the posterior over classes is a linear function of \(\pmb x\) \[ \log p(y = c | \pmb x, \pmb \theta) = \pmb w^\intercal \pmb x + \text{const} \]

Gaussian discriminant analysis1

Consider multivariate Gaussians as class conditional densities \[ p(\pmb x | y = c, \pmb \theta) = \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] which leads to the class posterior \[ p(y = c | \pmb x, \pmb \theta) \propto \pi_c \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] where \(\pi_c = p(y = c | \pmb \theta)\) is the prior probability of label \(c\).

Quadratic decision boundaries

The log posterior over class labels \[ \begin{align*} \log p(y=c | \pmb{x}, \pmb{\theta}) =& \log \pi_c - \frac 12 \log \vert 2\pi \Sigma_c \vert \\ &- \frac 12 (\pmb{x} - \pmb{\mu}_c)^\intercal \Sigma_c^{-1}(\pmb{x}-\pmb{\mu}_c) + \text{const} \end{align*} \] is called the discriminant function. The decision boundary between any two classes is quadratic in \(\pmb{x}\). This is known as a quadratic discriminant analysis (QDA).

Linear decision boundaries

Start from log posterior of a Gaussian discriminant analysis model \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \log \pi_c - \frac 12 \log \vert 2\pi {\bf \Sigma}_c \vert \\ &- \frac 12 (\pmb x - \pmb \mu_c)^\intercal {\bf \Sigma}_c^{-1} (\pmb x - \pmb \mu_c) \end{align*} \] Now assume that the covariance matrices are tied or shared across classes, \({\bf \Sigma}_c = {\bf \Sigma}\).1 \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \underbrace{\log \pi_c - \frac 12 \pmb \mu_c^\intercal{\bf \Sigma}^{-1} \pmb \mu_c}_{\gamma_c} + \pmb x^\intercal \underbrace{{\bf \Sigma}^{-1} \pmb \mu_c}_{\beta_c} \\ & + \underbrace{\text{const} - \frac 12 \pmb x^\intercal {\bf \Sigma}^{-1} \pmb x}_\kappa \\ =& \gamma_c + \pmb x^\intercal \beta_c - \kappa \end{align*} \] Importantly, the \(\kappa\) term is independent of \(c\), and thus an irrelevant additive constant that can be dropped. So the discriminant function is a linear function of \(\pmb x\). Hence it is an LDA.2

Naive Bayes

Naive Bayes assumption1

  • Features are conditionally independent given the class label.
  • Simple generative approach, with \(O(CD)\) parameters, for \(C\) classes and \(D\) features.

Use a class conditional density of the form \[ p(\pmb{x} | y=c, \pmb{\theta}) = \prod_{d=1}^D p(x_d | y=c, \pmb{\theta}_{dc}) \] where \(\pmb{\theta}_{dc}\) are the parameters for the class conditional density for class \(c\) and feature \(d\). The posterior over class labels yields \[ p(y=c| \pmb{x}, \pmb{\theta}) = \frac {p(y = c| \pmb \pi) \prod_{d=1}^D p(x_d | y = c, \pmb \theta_{dc})} {\sum_{c'} p(y = c' | \pmb \pi) \prod_{d=1}^D p(x_d | y = c', \pmb \theta_{dc})} \] where \(\pi_c\) is the prior probability of class \(c\) and \(\pmb{\theta} = (\pi, \{ \pmb{\theta}_{dc}\})\) are all the parameters. This is a naive Bayes classifier.

Examples of Naive Bayes classifiers

  • For binary features, \(x_d \in \{0, 1\}\), use the Bernoulli distribution \(p(\pmb{x} | y = c, \pmb \theta) = \prod_{d=1}^D \text{Ber}(x_d | \theta_{dc})\), where \(\theta_{dc}\) is the probability that \(x_d =1\) in class \(c\). Called multivariate Bernoulli naive Bayes. Does surprisingly well for the MNIST dataset.
  • For real-valued features, \(x_d \in \mathbb{R}\), we can use the univariate Gaussian distribution, \(p(\pmb{x} | y = c, \pmb \theta) = \prod_{d=1}^D \mathcal{N}(x_d | \mu_{dc}, \sigma_{dc}^2)\), where \(\mu_{dc}\) and \(\sigma_{dc}^2\) are the mean, variance of feature \(d\) when the class label is \(c\). This is equivalent to Gaussian discriminant analysis using diagonal covariance matrices.

Summary

Summary

\(K\)-nearest neighbors (KNN)
One of the simplest and most intuitive classifiers. Great in low dimension.
Curse of dimensionality
In high dimension, all points are far away. Difficult to use methods that rely on local densities. Instead: can use parametric models
Logistic regression
Binary case: Bernoulli distribution with sigmoid (logistic) function. Function defines a linear hyperplane that separates the two classes.
Linear discriminant analysis
Generative classifier (can generate features \(\pmb x\) for each class), with posterior that is a linear function of \(\pmb x\)

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.