Computational Statistics & Data Analysis (MVComp2)

Lecture 10: Classification

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Murphy (2022)
- Chapter 16: KNN classification
- Chapter 10: binary/multinomial logistic regression, linear/nonlinear classification
- Chapter 9: discriminative/generative classifiers, Gaussian discriminant analysis, LDA vs logistic regression

Recap from last time

Regression splines: provide a local approximation to the function. Used together with a linear combination of parameters, so as to retain convex optimization
B-spline basis functions: Offer a degree \(D\) of continuous derivatives. Cubic (\(D=3\)) often used.
Neural networks: composition of linear transformations connected by differentiable activation functions
Differentiability: key property to ensure efficient learning. Enabled/optimized by means of automatic differentiation (chain rule).

Classification

Visual illustration

Goal: Identify which class an object belongs to.
Figure: \(H_1\) and \(H_2\) classifiers correctly classify between black and white dots. \(H_2\) considered better because it is further from both groups. \(H_3\) fails.

\(K\)-nearest neighbors (KNN) classification

KNN classification¹

\(K\)-nearest neighbors²

One of the simplest classifiers!
For a new input \(\pmb{x}\):
- Find the \(K\) closest examples to \(\pmb{x}\) in the training set, \(N_K(\pmb{x}, \mathcal{D})\), and look at their labels
- Derive a distribution over the outputs for the local region around \(\pmb{x}\):³ \[ p(y = c | \pmb{x}, \mathcal{D}) = \frac 1K \sum_{n \in N_K(\pmb{x}, \mathcal{D})} \mathbb{I}(y_n = c) \]

Two main parameters:

Size of the neighborhood, \(K\)
Distance metric, \(d(\pmb{x}, \pmb{x}')\)

KNN classification

Classify the input green dot as blue or red?

Small radius (solid circle): green dot surrounded by 2 red triangles and 1 blue square. Classify as red triangle

Large radius (dotted circle): green dot surrounded by 2 red triangles and 3 blue squares. Classify as blue square

Curse of dimensionality

Note

KNN classifiers do not work well with high-dimensional inputs, due to the curse of dimensionality.

Volume of space grows exponentially fast with dimension, so nearest neighbor might be quite far away.

2 solutions:

make assumptions about the form of the classifier function (i.e., parametric model)¹
use a metric that only cares about a subset of the dimensions²

Logistic regression

Logistic regression

Discriminative classification model \(p(y | \pmb{x}; \pmb{\theta})\), where

\(\pmb{x} \in \mathbb{R}^D\) is a fixed-dimensional input vector
\(y \in \{ 1, \dots, C \}\) is the class label
- \(C=2\) for binary logistic regression
- \(C>2\) for multiclass logistic regression
\(\pmb{\theta}\) are the model parameters

Logistic regression

Binary logistic regression¹

Binary logistic regression

\[ p(y | \pmb{x}, \pmb{\theta}) = \text{Ber}(y | \sigma(\pmb{w}^\intercal \pmb{x} + b)) \] where \(\sigma\) is the sigmoid (logistic) function defined in lecture 9 (Neural networks). In other words \[ p(y=1 | \pmb{x}, \pmb{\theta}) = \sigma(a) = \frac 1{1 + {\rm e}^{-a}}, \] where \[ a = \pmb{w}^\intercal \pmb{x} + b = \log \left( \frac p{1-p} \right) \] \(a\) is called the logit or pre-activation.

Linear classifier¹

Sigmoid: probability that the class label is \(y=1\).
Optimal decision: predict \(y=1\) iff class 1 is more likely than class 0

\[ \begin{align*} f(\pmb{x}) &= \mathbb{I}\left[p(y=1| \pmb{x}) > p(y=0 | \pmb{x})\right] \\ &= \mathbb{I} \left( \log{\frac{p(y=1 | \pmb{x})}{p(y=0 | \pmb{x})} > 0} \right) \\ &= \mathbb{I}(a > 0) \end{align*} \]

but remember that \(a = \pmb{w}^\intercal \pmb{x} + b\). We can thus write the function as follows: \[ f(\pmb{x}; \pmb{\theta}) = b + \pmb{w}^\intercal x = b + \sum_{d=1}^D w_d x_d \] where \(\pmb{w}^\intercal x = \langle \pmb w, \pmb x \rangle\) is the inner product between the weight and feature vectors. The function defines a linear hyperplane, with normal vector \(\pmb w \in \mathbb{R}^D\) and an offset \(b \in \mathbb{R}\) from the origin.

Maximum likelihood estimation

Absorb the bias term \(b\) into the weight vector \(\pmb{w}\) \[ \begin{align*} \text{NLL}\pmb{w} &= - \frac{1}{N} \log p(\mathcal{D} | \pmb{w}) = -\frac{1}{N} \log \prod_{n=1}^N \text{Ber}(y_n | \mu_n) \\ &= \frac 1N \sum_{n=1}^N [y_n \log \mu_n + (1-y_n)\log(1-\mu_n)] \\ &= \frac 1N \sum_{n=1}^N \mathbb{H}(y_n, \mu_n) \end{align*} \] where \(\mathbb{H}\) is the binary cross entropy.¹

MLE: Optimizing the objective

Finding the MLE corresponds to solving \[ \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) = 0. \] To start, note that for \(\mu_n = \sigma(a_n)\) and \(a_n = \pmb{w}^\intercal\pmb{x}_n\), \[ \frac{\text{d}\mu_n}{\text{d}a_n} = \frac{\text{d}\sigma(a_n)}{\text{d}a_n} = \frac{\text{d}}{\text{d}a_n}\left( \frac 1{1 + \text{e}^{-a_n}}\right) = \sigma(a_n)\left(1-\sigma(a_n)\right). \]

Thus, by the chain rule \[ \frac{\partial}{\partial w_d} \mu_n = \frac{\partial}{\partial w_d} \sigma(\pmb{w}^\intercal\pmb{x}_n) = \frac{\partial}{\partial a_n} \sigma(a_n) \frac{\partial a_n}{\partial w_d} = \mu_n(1 - \mu_n) x_{nd}. \]

Ignoring the bias term, we end up with \[ \begin{align*} \nabla_{\pmb{w}} \log(\mu_n) &= \frac 1{\mu_n} \nabla_{\pmb{w}} \mu_n = (1-\mu_n)\pmb{x}_n \\ \nabla_{\pmb{w}} \log(1-\mu_n) &= \frac{-\mu_n (1-\mu_n)\pmb{x}_n}{1 - \mu_n} = -\mu_n \pmb{x}_n. \end{align*} \]

Combining all the relevant terms we get \[ \begin{align*} \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) &= -\frac 1N \sum_{n=1}^N [y_n \nabla_{\pmb{w}}\log \mu_n + (1-y_n)\nabla_{\pmb{w}}\log(1-\mu_n)] \\ &= -\frac 1N \sum_{n=1}^N [y_n (1 - \mu_n)\pmb{x}_n - (1-y_n) \mu_n \pmb{x}_n] \\ &= -\frac 1N \sum_{n=1}^N [y_n\pmb{x}_n - \mu_n \pmb{x}_n \mu_n - \pmb{x}_n\mu_n + y_n\pmb{x}_n\mu_n] \\ &= \frac 1N \sum_{n=1}^N (\mu_n - y_n)\pmb{x}_n. \end{align*} \] where we can interpret \(e_n = \mu_n - y_n\) as an error signal.¹ The gradient weights each input \(\pmb{x}_n\) by its error, and then averages the result.

MLE: Hessian

We have all we need to compute the Hessian \[ \begin{align*} \pmb{H}(\pmb{w}) = \nabla_{\pmb{w}}\nabla_{\pmb{w}}^\intercal \text{NLL}(\pmb{w}) = \frac 1N \sum_{n=1}^N (\mu_n (1 - \mu_n) \pmb{x}_n)\pmb{x}_n^\intercal = \frac 1N \pmb{X}^\intercal\pmb{SX} \end{align*} \] which is positive definite, because \[ \pmb{v}^\intercal\pmb{X}^\intercal\pmb{SXv} = \Vert \pmb{v}^\intercal\pmb{X}^\intercal \pmb{S}^{1/2} \Vert_2^2 > 0 \text{ for any }v > 0 \]

Nonlinear classifier¹

Same trick as for nonlinear regression: preprocess the inputs in a suitable way. Transform input feature vector using \(\pmb \phi(\pmb x)\)

Example

Use \(\pmb \phi(x_1, x_2) = [1, x_1^2, x_2^2]\) and let \(\pmb w = [-R^2, 1, 1]\).
Then \(\pmb{w}^\intercal \pmb \phi(\pmb{x}) = x_1^2 + x_2^2 - R^2\), so the decision boundary defines a circle with radius \(R\).

Even more expressive: learn the parameters of the feature extractor, \(\pmb \phi(\pmb x)\)

Multinomial logistic regression¹

Straightforward generalization of the binary logistic regression to \(C>2\) categories \[ p(y=c | \pmb x, \pmb \theta) = \frac {{\rm e}^{a_c}}{\sum_{c'=1}^C {\rm e}^{a_{c'}}} \] where \(\pmb a = {\bf W}\pmb x\) is the \(C\)-dimensional vector of logits. The normalization condition on the total probability allows us to set \(\pmb w_C = 0\).²

Linear discriminant analysis

Discriminative vs generative classifier¹

Discriminative classifier: Directly model the class posterior \(p(y | \pmb x, \pmb \theta)\)
Generative classifier: Classification model that specifies a way to generate the features \(\pmb x\) for each class \(c\):² \[ p(y = c | \pmb x, \pmb \theta) = \frac {p(\pmb x | y = c, \pmb \theta) p(y = c| \pmb \theta)} {\sum_{c'} p(\pmb x | y = c', \pmb \theta) p(y = c' | \pmb \theta)} \]

Linear discriminant analysis (LDA)¹

\[ p(y = c | \pmb x, \pmb \theta) = \frac {p(\pmb x | y = c, \pmb \theta) p(y = c| \pmb \theta)} {\sum_{c'} p(\pmb x | y = c', \pmb \theta) p(y = c' | \pmb \theta)} \] Special case of a generative classifier where the posterior over classes is a linear function of \(\pmb x\) \[ \log p(y = c | \pmb x, \pmb \theta) = \pmb w^\intercal \pmb x + \text{const} \]

Gaussian discriminant analysis¹

Consider multivariate Gaussians as class conditional densities \[ p(\pmb x | y = c, \pmb \theta) = \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] which leads to the class posterior \[ p(y = c | \pmb x, \pmb \theta) \propto \pi_c \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] where \(\pi_c = p(y = c | \pmb \theta)\) is the prior probability of label \(c\).

Quadratic decision boundaries

The log posterior over class labels \[ \begin{align*} \log p(y=c | \pmb{x}, \pmb{\theta}) =& \log \pi_c - \frac 12 \log \vert 2\pi \Sigma_c \vert \\ &- \frac 12 (\pmb{x} - \pmb{\mu}_c)^\intercal \Sigma_c^{-1}(\pmb{x}-\pmb{\mu}_c) + \text{const} \end{align*} \] is called the discriminant function. The decision boundary between any two classes is quadratic in \(\pmb{x}\). This is known as a quadratic discriminant analysis (QDA).

Linear decision boundaries

Start from log posterior of a Gaussian discriminant analysis model \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \log \pi_c - \frac 12 \log \vert 2\pi {\bf \Sigma}_c \vert \\ &- \frac 12 (\pmb x - \pmb \mu_c)^\intercal {\bf \Sigma}_c^{-1} (\pmb x - \pmb \mu_c) \end{align*} \] Now assume that the covariance matrices are tied or shared across classes, \({\bf \Sigma}_c = {\bf \Sigma}\).¹ \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \underbrace{\log \pi_c - \frac 12 \pmb \mu_c^\intercal{\bf \Sigma}^{-1} \pmb \mu_c}_{\gamma_c} + \pmb x^\intercal \underbrace{{\bf \Sigma}^{-1} \pmb \mu_c}_{\beta_c} \\ & + \underbrace{\text{const} - \frac 12 \pmb x^\intercal {\bf \Sigma}^{-1} \pmb x}_\kappa \\ =& \gamma_c + \pmb x^\intercal \beta_c - \kappa \end{align*} \] Importantly, the \(\kappa\) term is independent of \(c\), and thus an irrelevant additive constant that can be dropped. So the discriminant function is a linear function of \(\pmb x\). Hence it is an LDA.²

Naive Bayes

Naive Bayes assumption¹

Features are conditionally independent given the class label.
Simple generative approach, with \(O(CD)\) parameters, for \(C\) classes and \(D\) features.

Use a class conditional density of the form \[ p(\pmb{x} | y=c, \pmb{\theta}) = \prod_{d=1}^D p(x_d | y=c, \pmb{\theta}_{dc}) \] where \(\pmb{\theta}_{dc}\) are the parameters for the class conditional density for class \(c\) and feature \(d\). The posterior over class labels yields \[ p(y=c| \pmb{x}, \pmb{\theta}) = \frac {p(y = c| \pmb \pi) \prod_{d=1}^D p(x_d | y = c, \pmb \theta_{dc})} {\sum_{c'} p(y = c' | \pmb \pi) \prod_{d=1}^D p(x_d | y = c', \pmb \theta_{dc})} \] where \(\pi_c\) is the prior probability of class \(c\) and \(\pmb{\theta} = (\pi, \{ \pmb{\theta}_{dc}\})\) are all the parameters. This is a naive Bayes classifier.

Examples of Naive Bayes classifiers

For binary features, \(x_d \in \{0, 1\}\), use the Bernoulli distribution \(p(\pmb{x} | y = c, \pmb \theta) = \prod_{d=1}^D \text{Ber}(x_d | \theta_{dc})\), where \(\theta_{dc}\) is the probability that \(x_d =1\) in class \(c\). Called multivariate Bernoulli naive Bayes. Does surprisingly well for the MNIST dataset.
For real-valued features, \(x_d \in \mathbb{R}\), we can use the univariate Gaussian distribution, \(p(\pmb{x} | y = c, \pmb \theta) = \prod_{d=1}^D \mathcal{N}(x_d | \mu_{dc}, \sigma_{dc}^2)\), where \(\mu_{dc}\) and \(\sigma_{dc}^2\) are the mean, variance of feature \(d\) when the class label is \(c\). This is equivalent to Gaussian discriminant analysis using diagonal covariance matrices.

Summary

\(K\)-nearest neighbors (KNN): One of the simplest and most intuitive classifiers. Great in low dimension.
Curse of dimensionality: In high dimension, all points are far away. Difficult to use methods that rely on local densities. Instead: can use parametric models
Logistic regression: Binary case: Bernoulli distribution with sigmoid (logistic) function. Function defines a linear hyperplane that separates the two classes.
Linear discriminant analysis: Generative classifier (can generate features \(\pmb x\) for each class), with posterior that is a linear function of \(\pmb x\)

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Classification

Visual illustration

\(K\)-nearest neighbors (KNN) classification

KNN classification1

KNN classification

Curse of dimensionality

Logistic regression

Logistic regression

Logistic regression

Binary logistic regression1

Linear classifier1

Maximum likelihood estimation

MLE: Optimizing the objective

MLE: Hessian

Nonlinear classifier1

Multinomial logistic regression1

Linear discriminant analysis

Discriminative vs generative classifier1

Linear discriminant analysis (LDA)1

Gaussian discriminant analysis1

Quadratic decision boundaries

Linear decision boundaries

Naive Bayes

Examples of Naive Bayes classifiers

Summary

Summary

References

KNN classification¹

Binary logistic regression¹

Linear classifier¹

Nonlinear classifier¹

Multinomial logistic regression¹

Discriminative vs generative classifier¹

Linear discriminant analysis (LDA)¹

Gaussian discriminant analysis¹