Computational Statistics & Data Analysis (MVComp2)

Lecture 9: Nonlinear regression

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

  • Chapter 11 from Murphy (2022): Basis expansion
  • Chapter 13 from Murphy (2022): Neural networks

Recap from last time

Generalization
Dangers of overfitting and underfitting are always present
Maximum-a-posteriori (MAP) estimate
Regularizes the negative log-likelihood by the log of the prior
Cross-validation
Common technique to choose regularization parameter
Regularization in linear regression
Ridge (\(\ell_2\)) and Lasso (\(\ell_0\) or \(\ell_1\)) constrain the strength and the number of weights, respectively

Basis expansions

Example: Polynomial fitting1

\[ f(x; {\bf w}) = \sum_{d=0}^D w_d x^d = {\bf w}^\intercal[1, x, x^2, \dots, x^D] \]

Regression splines1

  • Using polynomial basis functions creates a nonlinear mapping from input to output, even though the model remains linear in the parameters.
  • Problem with polynomials: they form a global approximation to the function.
  • More flexibility with a series of local approximations (i.e., local support)

Consider the 1D case with a set of basis functions: \[ f(x; {\theta}) = \sum_{i=1}^m w_i B_i(x) \]

B-spline basis functions1

B-spline basis functions: Properties1

  • Piecewise polynomial of degree \(D\), where the locations of the pieces are defined by knots \(t_1<\dots<t_m\).
  • Function is continuous and has continuous derivatives of orders \(1, \dots, D-1\)
  • Common to use cubic splines, i.e., \(D=3\): will have continuous first and second derivatives.

Fitting with splines1

Neural networks

Introduction1

Exploit combination of feature transformation yet linearity in the parameters \[ f(\pmb{x}; \pmb{\theta}) = {\bf W}\phi(\pmb{x}) + \pmb{b} \]

Endow the feature extractor with it own parameters, \(\pmb{\theta}_2\), \[ f(\pmb{x}; \pmb{\theta}) = {\bf W}\phi(\pmb{x}; \pmb{\theta}_2) + \pmb{b} \] where \(\pmb{\theta} = (\pmb{\theta}_1, \pmb{\theta}_2)\) and \(\pmb{\theta}_1 = ({\bf W}, \pmb{b})\).

Repeat this process recursively, to create more and more complex functions. For \(L\) function compositions, we get \[ f(\pmb{x}; \pmb{\theta}) = f_L(f_{L-1}(\dots(f_1(\pmb{x}))\dots)) \] where \(f_l(\pmb{x}) = f(\pmb{x}; \theta_l)\) is the function at layer \(l\). This is a deep neural network (DNN).

  • DNNs map input to output by means of a directed acyclic graph.2
  • Perceptron: model for an artificial neuron. Simple neural network is a multilayer perceptron.

Perceptron1

Deterministic binary classifier2 \[ f(\pmb{x}_n ; \pmb{\theta}) = H(\pmb{w}^\intercal\pmb{x}_n + b > 0) \] relies on a Heaviside function, \(H\), i.e., it is not differentiable, and so we cannot use gradient-based optimization.

Perceptron learning algorithm

Rosenblatt (1958) proposed a learning algorithm that updates the weights only when the model makes a prediction mistake \[ \pmb{w}_{t+1} = \pmb{w}_t - \eta_t (\hat y_n - y_n)\pmb{x}_n \]

Multilayer perceptron (MLP): The XOR problem1

Learn a function that computes the exclusive OR of its two binary inputs. XOR problem is not linearly separable.

\(x_1\) \(x_2\) \(y\)
0 0 0
0 1 1
1 0 1
1 1 0

Stacking multiple perceptrons is called multilayer perceptron.

First hidden unit

The first hidden unit computes \(h_1 = x_1 \wedge x_2\) (“AND”) with appropriate weights and bias. \(h_1\) will fire iff \(x_1\) and \(x_2\) are both on, since \[ \pmb{w}_1^\intercal \pmb{x} - b_1 = \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} [1, 1] - 1.5 = 0.5 > 0 \]

Second hidden unit

Similarly, the second unit computes \(h_2 = x_1 \vee x_2\) (“OR”) because its bias is -0.5.

Output

The output yields \(y = \overline{h_1} \wedge h_2\) (“AND”), equivalent to \[ y = f(x_1, x_2) = \overline{(x_1 \wedge x_2)} \wedge (x_1 \vee x_2) \]

Differentiability1

The non-differentiable Heaviside functions makes the original MLPs difficult to train.

Let’s replace the Heaviside function with a differentiable activation function, \(\varphi : \mathbb{R} \to \mathbb{R}\). \[ z_l = f_l(\pmb{z}_{l-1}) = \varphi_l({\bf W}_l\pmb{z}_{l-1} + b_l) \] or, in scalar form \[ z_{kl} = \varphi_l \left( \sum_{j} w_{lkj} z_{jl-1} + b_{kl} \right) \]

Backpropagation
We can now compose \(L\) of these functions together, and compute the gradient of the output wrt the parameters in each layer using the chain rule.

Activation functions1

We are free to use any kind of differentiable activation function at each layer. However, a linear activation function, \(\varphi_l(a) = c_l a\) reduces to a regular linear model! Important to use non-linear activation functions

Early days: sigmoid (logistic) function

\[ \sigma(a) = \frac 1{1+ {\rm e}^{-a}} \] gradient will be close to zero in the saturated regimes (both negative and positive). Any gradient signal from higher layers will not be able to propagate back to earlier layers. Leads to vanishing gradient problem.

rectified linear unit (ReLU)

Training very deep models requires non-saturating activation functions \[ \text{ReLU}(a) = \max(a, 0) = a H(a) \] ReLU “turns off” negative inputs and passes positive inputs unchanged.

Example: classification of 2d data into 2 categories

https://playground.tensorflow.org/

Decision is binary: use a Bernoulli distribution with learnable parameter \[ \begin{aligned} p(y | \pmb{x}; \pmb{\theta}) &= \text{Ber}(y | \sigma(a_3)) \\ a_3 &= \pmb{w}_3^\intercal \pmb{z}_2 + b_3 \\ \pmb{z}_2 &= \varphi(\pmb{W}_2 \pmb{z}_1 + \pmb{b}_2) \\ \pmb{z}_1 &= \varphi(\pmb{W}_1 \pmb{x} + \pmb{b}_1) \end{aligned} \] which uses sigmoid function for the activations.

The importance of depth

Universal function approximator
Possible to show that an MLP with one hidden layer is a universal function approximator, i.e., it can model any suitably smooth function, given enough hidden units, to any desired level of accuracy.
Intuition
Each hidden unit can specify a half plane. With enough of these you can “carve up” any region of space.
In practice
Deep networks do perform better than shallow ones. They can leverage the features that are learned by earlier layers. See XOR example.

Connections with biology

  • Output axon of the left neuron makes a synaptic connection with the dendrites of the cell on the right
  • Communication: electrical charges in the form of ion flow
  • “Neuron firing”: activation exceeds threshold

Connections with biology

Artificial neural networks as model of the brain?

  • Brains do not use backpropagation (there is no way to send information backwards along an axon). Instead, they use local update rules to adjust synaptic strength
  • Most ANNs are strictly feedforward, but real brains have many feedback connections. Feedback is believed to act like a prior. Combine with likelihoods from the sensory system to compute a posterior over hidden states of the world, used for optimal decision making
  • Biological neurons have complex dendritic tree structures, with complex spatio-temporal dynamics
  • ANNs specialize to one task; a brain is composed of multple specialized interacting modules.
  • Power efficiency

Backpropagation1

Consider a mapping, \(\pmb{o} = \pmb{f}(\pmb{x})\), where \(\pmb{x} \in \mathbb{R}^n\) and \(\pmb{o} \in \mathbb{R}^m\). Assume \(\pmb{f}\) is defined as a composition of functions \[ \pmb{f} = \pmb{f}_4 \circ \pmb{f}_3 \circ \pmb{f}_2 \circ \pmb{f}_1. \]

Let’s compute the Jacobian \({\bf J}_{\pmb{f}}(\pmb{x}) = \frac{\partial \pmb{o}}{\partial \pmb{x}} \in \mathbb{R}^{m\times n}\) using the chain rule: \[ \frac{\partial \pmb{o}}{\partial \pmb{x}} = \frac{\partial \pmb{f}_4(\pmb{x}_4)}{\partial \pmb{x}_4} \frac{\partial \pmb{f}_3(\pmb{x}_3)}{\partial \pmb{x}_3} \frac{\partial \pmb{f}_2(\pmb{x}_2)}{\partial \pmb{x}_2} \frac{\partial \pmb{f}_1(\pmb{x}_1)}{\partial \pmb{x}} \]

Computation graphs2

Express any function as a graph, where each node is a differentiable function of all its inputs. The chain rule takes care of combining the node elements. Often called differentiable programming.

Simple example in Pytorch

import torch
import torch.nn as nn
import torch.optim as optim

# Ensure reproducibility
torch.manual_seed(0)

# 1. Generate a toy dataset
# Let's generate some random data with shape (100, 1)
inputs = torch.rand(100, 1)
# Let's use a simple linear relation with added noise for targets
targets = 2.5 * inputs + 3 + torch.randn(100, 1) * 0.1

# 2. Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(1, 10)  # First fully-connected layer
        self.fc2 = nn.Linear(10, 1)  # Second fully-connected layer
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)  # Activation function
        return self.fc2(x)

# Create an instance of the network
model = SimpleNN()

# 3. Define the loss and optimizer
criterion = nn.MSELoss()  # Mean squared error
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 4. Train the network
num_epochs = 500
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Zero gradients, backward pass, optimizer step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Print the loss every 50 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")
Epoch [100/500], Loss: 0.0366
Epoch [200/500], Loss: 0.0167
Epoch [300/500], Loss: 0.0105
Epoch [400/500], Loss: 0.0086
Epoch [500/500], Loss: 0.0080

Summary

Summary

Regression splines
provide a local approximation to the function. Used together with a linear combination of parameters, so as to retain convex optimization
B-spline basis functions
Offer a degree \(D\) of continuous derivatives. Cubic (\(D=3\)) often used.
Neural networks
composition of linear transformations connected by differentiable activation functions
Differentiability
key property to ensure efficient learning. Enabled/optimized by means of automatic differentiation (chain rule).

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.
Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386.