Lecture 10: Classification
Institute for Theoretical Physics, Heidelberg University
\(K\)-nearest neighbors2
Two main parameters:
Classify the input green dot as blue or red?
Note
KNN classifiers do not work well with high-dimensional inputs, due to the curse of dimensionality.
Volume of space grows exponentially fast with dimension, so nearest neighbor might be quite far away.
2 solutions:
Logistic regression
Discriminative classification model \(p(y | \pmb{x}; \pmb{\theta})\), where
Binary logistic regression
\[ p(y | \pmb{x}, \pmb{\theta}) = \text{Ber}(y | \sigma(\pmb{w}^\intercal \pmb{x} + b)) \] where \(\sigma\) is the sigmoid (logistic) function defined in lecture 9 (Neural networks). In other words \[ p(y=1 | \pmb{x}, \pmb{\theta}) = \sigma(a) = \frac 1{1 + {\rm e}^{-a}}, \] where \[ a = \pmb{w}^\intercal \pmb{x} + b = \log \left( \frac p{1-p} \right) \] \(a\) is called the logit or pre-activation.
\[ \begin{align*} f(\pmb{x}) &= \mathbb{I}\left[p(y=1| \pmb{x}) > p(y=0 | \pmb{x})\right] \\ &= \mathbb{I} \left( \log{\frac{p(y=1 | \pmb{x})}{p(y=0 | \pmb{x})} > 0} \right) \\ &= \mathbb{I}(a > 0) \end{align*} \]
but remember that \(a = \pmb{w}^\intercal \pmb{x} + b\). We can thus write the function as follows: \[ f(\pmb{x}; \pmb{\theta}) = b + \pmb{w}^\intercal x = b + \sum_{d=1}^D w_d x_d \] where \(\pmb{w}^\intercal x = \langle \pmb w, \pmb x \rangle\) is the inner product between the weight and feature vectors. The function defines a linear hyperplane, with normal vector \(\pmb w \in \mathbb{R}^D\) and an offset \(b \in \mathbb{R}\) from the origin.
Absorb the bias term \(b\) into the weight vector \(\pmb{w}\) \[ \begin{align*} \text{NLL}\pmb{w} &= - \frac{1}{N} \log p(\mathcal{D} | \pmb{w}) = -\frac{1}{N} \log \prod_{n=1}^N \text{Ber}(y_n | \mu_n) \\ &= \frac 1N \sum_{n=1}^N [y_n \log \mu_n + (1-y_n)\log(1-\mu_n)] \\ &= \frac 1N \sum_{n=1}^N \mathbb{H}(y_n, \mu_n) \end{align*} \] where \(\mathbb{H}\) is the binary cross entropy.1
Finding the MLE corresponds to solving \[ \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) = 0. \] To start, note that for \(\mu_n = \sigma(a_n)\) and \(a_n = \pmb{w}^\intercal\pmb{x}_n\), \[ \frac{\text{d}\mu_n}{\text{d}a_n} = \frac{\text{d}\sigma(a_n)}{\text{d}a_n} = \frac{\text{d}}{\text{d}a_n}\left( \frac 1{1 + \text{e}^{-a_n}}\right) = \sigma(a_n)\left(1-\sigma(a_n)\right). \]
Thus, by the chain rule \[ \frac{\partial}{\partial w_d} \mu_n = \frac{\partial}{\partial w_d} \sigma(\pmb{w}^\intercal\pmb{x}_n) = \frac{\partial}{\partial a_n} \sigma(a_n) \frac{\partial a_n}{\partial w_d} = \mu_n(1 - \mu_n) x_{nd}. \]
Ignoring the bias term, we end up with \[ \begin{align*} \nabla_{\pmb{w}} \log(\mu_n) &= \frac 1{\mu_n} \nabla_{\pmb{w}} \mu_n = (1-\mu_n)\pmb{x}_n \\ \nabla_{\pmb{w}} \log(1-\mu_n) &= \frac{-\mu_n (1-\mu_n)\pmb{x}_n}{1 - \mu_n} = -\mu_n \pmb{x}_n. \end{align*} \]
Combining all the relevant terms we get \[ \begin{align*} \nabla_{\pmb{w}} \text{NLL}(\pmb{w}) &= -\frac 1N \sum_{n=1}^N [y_n \nabla_{\pmb{w}}\log \mu_n + (1-y_n)\nabla_{\pmb{w}}\log(1-\mu_n)] \\ &= -\frac 1N \sum_{n=1}^N [y_n (1 - \mu_n)\pmb{x}_n - (1-y_n) \mu_n \pmb{x}_n] \\ &= -\frac 1N \sum_{n=1}^N [y_n\pmb{x}_n - \mu_n \pmb{x}_n \mu_n - \pmb{x}_n\mu_n + y_n\pmb{x}_n\mu_n] \\ &= \frac 1N \sum_{n=1}^N (\mu_n - y_n)\pmb{x}_n. \end{align*} \] where we can interpret \(e_n = \mu_n - y_n\) as an error signal.1 The gradient weights each input \(\pmb{x}_n\) by its error, and then averages the result.
We have all we need to compute the Hessian \[ \begin{align*} \pmb{H}(\pmb{w}) = \nabla_{\pmb{w}}\nabla_{\pmb{w}}^\intercal \text{NLL}(\pmb{w}) = \frac 1N \sum_{n=1}^N (\mu_n (1 - \mu_n) \pmb{x}_n)\pmb{x}_n^\intercal = \frac 1N \pmb{X}^\intercal\pmb{SX} \end{align*} \] which is positive definite, because \[ \pmb{v}^\intercal\pmb{X}^\intercal\pmb{SXv} = \Vert \pmb{v}^\intercal\pmb{X}^\intercal \pmb{S}^{1/2} \Vert_2^2 > 0 \text{ for any }v > 0 \]
Example
Straightforward generalization of the binary logistic regression to \(C>2\) categories \[ p(y=c | \pmb x, \pmb \theta) = \frac {{\rm e}^{a_c}}{\sum_{c'=1}^C {\rm e}^{a_{c'}}} \] where \(\pmb a = {\bf W}\pmb x\) is the \(C\)-dimensional vector of logits. The normalization condition on the total probability allows us to set \(\pmb w_C = 0\).2
\[ p(y = c | \pmb x, \pmb \theta) = \frac {p(\pmb x | y = c, \pmb \theta) p(y = c| \pmb \theta)} {\sum_{c'} p(\pmb x | y = c', \pmb \theta) p(y = c' | \pmb \theta)} \] Special case of a generative classifier where the posterior over classes is a linear function of \(\pmb x\) \[ \log p(y = c | \pmb x, \pmb \theta) = \pmb w^\intercal \pmb x + \text{const} \]
Consider multivariate Gaussians as class conditional densities \[ p(\pmb x | y = c, \pmb \theta) = \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] which leads to the class posterior \[ p(y = c | \pmb x, \pmb \theta) \propto \pi_c \mathcal{N}(\pmb x | \pmb \mu_c, {\bf \Sigma}_c) \] where \(\pi_c = p(y = c | \pmb \theta)\) is the prior probability of label \(c\).
The log posterior over class labels \[ \begin{align*} \log p(y=c | \pmb{x}, \pmb{\theta}) =& \log \pi_c - \frac 12 \log \vert 2\pi \Sigma_c \vert \\ &- \frac 12 (\pmb{x} - \pmb{\mu}_c)^\intercal \Sigma_c^{-1}(\pmb{x}-\pmb{\mu}_c) + \text{const} \end{align*} \] is called the discriminant function. The decision boundary between any two classes is quadratic in \(\pmb{x}\). This is known as a quadratic discriminant analysis (QDA).
Start from log posterior of a Gaussian discriminant analysis model \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \log \pi_c - \frac 12 \log \vert 2\pi {\bf \Sigma}_c \vert \\ &- \frac 12 (\pmb x - \pmb \mu_c)^\intercal {\bf \Sigma}_c^{-1} (\pmb x - \pmb \mu_c) \end{align*} \] Now assume that the covariance matrices are tied or shared across classes, \({\bf \Sigma}_c = {\bf \Sigma}\).1 \[ \begin{align*} \log p(y = c | \pmb x, \pmb \theta) =& \underbrace{\log \pi_c - \frac 12 \pmb \mu_c^\intercal{\bf \Sigma}^{-1} \pmb \mu_c}_{\gamma_c} + \pmb x^\intercal \underbrace{{\bf \Sigma}^{-1} \pmb \mu_c}_{\beta_c} \\ & + \underbrace{\text{const} - \frac 12 \pmb x^\intercal {\bf \Sigma}^{-1} \pmb x}_\kappa \\ =& \gamma_c + \pmb x^\intercal \beta_c - \kappa \end{align*} \] Importantly, the \(\kappa\) term is independent of \(c\), and thus an irrelevant additive constant that can be dropped. So the discriminant function is a linear function of \(\pmb x\). Hence it is an LDA.2
Naive Bayes assumption1
Use a class conditional density of the form \[ p(\pmb{x} | y=c, \pmb{\theta}) = \prod_{d=1}^D p(x_d | y=c, \pmb{\theta}_{dc}) \] where \(\pmb{\theta}_{dc}\) are the parameters for the class conditional density for class \(c\) and feature \(d\). The posterior over class labels yields \[ p(y=c| \pmb{x}, \pmb{\theta}) = \frac {p(y = c| \pmb \pi) \prod_{d=1}^D p(x_d | y = c, \pmb \theta_{dc})} {\sum_{c'} p(y = c' | \pmb \pi) \prod_{d=1}^D p(x_d | y = c', \pmb \theta_{dc})} \] where \(\pi_c\) is the prior probability of class \(c\) and \(\pmb{\theta} = (\pi, \{ \pmb{\theta}_{dc}\})\) are all the parameters. This is a naive Bayes classifier.