Computational Statistics & Data Analysis (MVComp2)

Lecture 9: Nonlinear regression

Tristan Bereau

Institute for Theoretical Physics, Heidelberg University

Introduction

Literature

Chapter 11 from Murphy (2022): Basis expansion
Chapter 13 from Murphy (2022): Neural networks

Recap from last time

Generalization: Dangers of overfitting and underfitting are always present
Maximum-a-posteriori (MAP) estimate: Regularizes the negative log-likelihood by the log of the prior
Cross-validation: Common technique to choose regularization parameter
Regularization in linear regression: Ridge (\(\ell_2\)) and Lasso (\(\ell_0\) or \(\ell_1\)) constrain the strength and the number of weights, respectively

Basis expansions

Example: Polynomial fitting¹

\[ f(x; {\bf w}) = \sum_{d=0}^D w_d x^d = {\bf w}^\intercal[1, x, x^2, \dots, x^D] \]

Regression splines¹

Using polynomial basis functions creates a nonlinear mapping from input to output, even though the model remains linear in the parameters.
Problem with polynomials: they form a global approximation to the function.
More flexibility with a series of local approximations (i.e., local support)

Consider the 1D case with a set of basis functions: \[ f(x; {\theta}) = \sum_{i=1}^m w_i B_i(x) \]

B-spline basis functions¹

B-spline basis functions: Properties¹

Piecewise polynomial of degree \(D\), where the locations of the pieces are defined by knots \(t_1<\dots<t_m\).
Function is continuous and has continuous derivatives of orders \(1, \dots, D-1\)
Common to use cubic splines, i.e., \(D=3\): will have continuous first and second derivatives.

Fitting with splines¹

Neural networks

Introduction¹

Exploit combination of feature transformation yet linearity in the parameters \[ f(\pmb{x}; \pmb{\theta}) = {\bf W}\phi(\pmb{x}) + \pmb{b} \]

Endow the feature extractor with it own parameters, \(\pmb{\theta}_2\), \[ f(\pmb{x}; \pmb{\theta}) = {\bf W}\phi(\pmb{x}; \pmb{\theta}_2) + \pmb{b} \] where \(\pmb{\theta} = (\pmb{\theta}_1, \pmb{\theta}_2)\) and \(\pmb{\theta}_1 = ({\bf W}, \pmb{b})\).

Repeat this process recursively, to create more and more complex functions. For \(L\) function compositions, we get \[ f(\pmb{x}; \pmb{\theta}) = f_L(f_{L-1}(\dots(f_1(\pmb{x}))\dots)) \] where \(f_l(\pmb{x}) = f(\pmb{x}; \theta_l)\) is the function at layer \(l\). This is a deep neural network (DNN).

DNNs map input to output by means of a directed acyclic graph.²
Perceptron: model for an artificial neuron. Simple neural network is a multilayer perceptron.

Perceptron¹

Deterministic binary classifier² \[ f(\pmb{x}_n ; \pmb{\theta}) = H(\pmb{w}^\intercal\pmb{x}_n + b > 0) \] relies on a Heaviside function, \(H\), i.e., it is not differentiable, and so we cannot use gradient-based optimization.

Perceptron learning algorithm

Rosenblatt (1958) proposed a learning algorithm that updates the weights only when the model makes a prediction mistake \[ \pmb{w}_{t+1} = \pmb{w}_t - \eta_t (\hat y_n - y_n)\pmb{x}_n \]

Multilayer perceptron (MLP): The XOR problem¹

Learn a function that computes the exclusive OR of its two binary inputs. XOR problem is not linearly separable.

\(x_1\)	\(x_2\)	\(y\)
0	0	0
0	1	1
1	0	1
1	1	0

Stacking multiple perceptrons is called multilayer perceptron.

First hidden unit

The first hidden unit computes \(h_1 = x_1 \wedge x_2\) (“AND”) with appropriate weights and bias. \(h_1\) will fire iff \(x_1\) and \(x_2\) are both on, since \[ \pmb{w}_1^\intercal \pmb{x} - b_1 = \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} [1, 1] - 1.5 = 0.5 > 0 \]

Second hidden unit

Similarly, the second unit computes \(h_2 = x_1 \vee x_2\) (“OR”) because its bias is -0.5.

Output

The output yields \(y = \overline{h_1} \wedge h_2\) (“AND”), equivalent to \[ y = f(x_1, x_2) = \overline{(x_1 \wedge x_2)} \wedge (x_1 \vee x_2) \]

Differentiability¹

The non-differentiable Heaviside functions makes the original MLPs difficult to train.

Let’s replace the Heaviside function with a differentiable activation function, \(\varphi : \mathbb{R} \to \mathbb{R}\). \[ z_l = f_l(\pmb{z}_{l-1}) = \varphi_l({\bf W}_l\pmb{z}_{l-1} + b_l) \] or, in scalar form \[ z_{kl} = \varphi_l \left( \sum_{j} w_{lkj} z_{jl-1} + b_{kl} \right) \]

Backpropagation: We can now compose \(L\) of these functions together, and compute the gradient of the output wrt the parameters in each layer using the chain rule.

Activation functions¹

We are free to use any kind of differentiable activation function at each layer. However, a linear activation function, \(\varphi_l(a) = c_l a\) reduces to a regular linear model! Important to use non-linear activation functions

Early days: sigmoid (logistic) function

\[ \sigma(a) = \frac 1{1+ {\rm e}^{-a}} \] gradient will be close to zero in the saturated regimes (both negative and positive). Any gradient signal from higher layers will not be able to propagate back to earlier layers. Leads to vanishing gradient problem.

rectified linear unit (ReLU)

Training very deep models requires non-saturating activation functions \[ \text{ReLU}(a) = \max(a, 0) = a H(a) \] ReLU “turns off” negative inputs and passes positive inputs unchanged.

Example: classification of 2d data into 2 categories

https://playground.tensorflow.org/

Decision is binary: use a Bernoulli distribution with learnable parameter \[ \begin{aligned} p(y | \pmb{x}; \pmb{\theta}) &= \text{Ber}(y | \sigma(a_3)) \\ a_3 &= \pmb{w}_3^\intercal \pmb{z}_2 + b_3 \\ \pmb{z}_2 &= \varphi(\pmb{W}_2 \pmb{z}_1 + \pmb{b}_2) \\ \pmb{z}_1 &= \varphi(\pmb{W}_1 \pmb{x} + \pmb{b}_1) \end{aligned} \] which uses sigmoid function for the activations.

The importance of depth

Universal function approximator: Possible to show that an MLP with one hidden layer is a universal function approximator, i.e., it can model any suitably smooth function, given enough hidden units, to any desired level of accuracy.
Intuition: Each hidden unit can specify a half plane. With enough of these you can “carve up” any region of space.
In practice: Deep networks do perform better than shallow ones. They can leverage the features that are learned by earlier layers. See XOR example.

Connections with biology

Output axon of the left neuron makes a synaptic connection with the dendrites of the cell on the right
Communication: electrical charges in the form of ion flow
“Neuron firing”: activation exceeds threshold

Connections with biology

Artificial neural networks as model of the brain?

Brains do not use backpropagation (there is no way to send information backwards along an axon). Instead, they use local update rules to adjust synaptic strength
Most ANNs are strictly feedforward, but real brains have many feedback connections. Feedback is believed to act like a prior. Combine with likelihoods from the sensory system to compute a posterior over hidden states of the world, used for optimal decision making
Biological neurons have complex dendritic tree structures, with complex spatio-temporal dynamics
ANNs specialize to one task; a brain is composed of multple specialized interacting modules.
Power efficiency

Backpropagation¹

Consider a mapping, \(\pmb{o} = \pmb{f}(\pmb{x})\), where \(\pmb{x} \in \mathbb{R}^n\) and \(\pmb{o} \in \mathbb{R}^m\). Assume \(\pmb{f}\) is defined as a composition of functions \[ \pmb{f} = \pmb{f}_4 \circ \pmb{f}_3 \circ \pmb{f}_2 \circ \pmb{f}_1. \]

Let’s compute the Jacobian \({\bf J}_{\pmb{f}}(\pmb{x}) = \frac{\partial \pmb{o}}{\partial \pmb{x}} \in \mathbb{R}^{m\times n}\) using the chain rule: \[ \frac{\partial \pmb{o}}{\partial \pmb{x}} = \frac{\partial \pmb{f}_4(\pmb{x}_4)}{\partial \pmb{x}_4} \frac{\partial \pmb{f}_3(\pmb{x}_3)}{\partial \pmb{x}_3} \frac{\partial \pmb{f}_2(\pmb{x}_2)}{\partial \pmb{x}_2} \frac{\partial \pmb{f}_1(\pmb{x}_1)}{\partial \pmb{x}} \]

Computation graphs²

Express any function as a graph, where each node is a differentiable function of all its inputs. The chain rule takes care of combining the node elements. Often called differentiable programming.

Simple example in Pytorch

import torch
import torch.nn as nn
import torch.optim as optim

# Ensure reproducibility
torch.manual_seed(0)

# 1. Generate a toy dataset
# Let's generate some random data with shape (100, 1)
inputs = torch.rand(100, 1)
# Let's use a simple linear relation with added noise for targets
targets = 2.5 * inputs + 3 + torch.randn(100, 1) * 0.1

# 2. Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(1, 10)  # First fully-connected layer
        self.fc2 = nn.Linear(10, 1)  # Second fully-connected layer
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)  # Activation function
        return self.fc2(x)

# Create an instance of the network
model = SimpleNN()

# 3. Define the loss and optimizer
criterion = nn.MSELoss()  # Mean squared error
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 4. Train the network
num_epochs = 500
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Zero gradients, backward pass, optimizer step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Print the loss every 50 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")

Epoch [100/500], Loss: 0.0366
Epoch [200/500], Loss: 0.0167
Epoch [300/500], Loss: 0.0105
Epoch [400/500], Loss: 0.0086
Epoch [500/500], Loss: 0.0080

Summary

Regression splines: provide a local approximation to the function. Used together with a linear combination of parameters, so as to retain convex optimization
B-spline basis functions: Offer a degree \(D\) of continuous derivatives. Cubic (\(D=3\)) often used.
Neural networks: composition of linear transformations connected by differentiable activation functions
Differentiability: key property to ensure efficient learning. Enabled/optimized by means of automatic differentiation (chain rule).

References

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT press.

Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386.

Computational Statistics & Data Analysis (MVComp2)

Introduction

Literature

Recap from last time

Basis expansions

Example: Polynomial fitting1

Regression splines1

B-spline basis functions1

B-spline basis functions: Properties1

Fitting with splines1

Neural networks

Introduction1

Perceptron1

Multilayer perceptron (MLP): The XOR problem1

First hidden unit

Second hidden unit

Output

Differentiability1

Activation functions1

Example: classification of 2d data into 2 categories

The importance of depth

Connections with biology

Connections with biology

Artificial neural networks as model of the brain?

Backpropagation1

Computation graphs2

Simple example in Pytorch

Summary

Summary

References

Example: Polynomial fitting¹

Regression splines¹

B-spline basis functions¹

B-spline basis functions: Properties¹

Fitting with splines¹

Introduction¹

Perceptron¹

Multilayer perceptron (MLP): The XOR problem¹

Differentiability¹

Activation functions¹

Backpropagation¹

Computation graphs²