Session 7 - Neural Networks

Neural Networks

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 7 - Neural Networks

Last Time

Session 7 - Neural Networks

ML Definition

Session 7 - Neural Networks

The Data Science Process

Data Science Process
Session 7 - Neural Networks

Machine Learning Pipeline

Data Assess Pipeline
Session 7 - Neural Networks

Probabilistic Interpretation of Linear Regression

Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.

Session 7 - Neural Networks

Probabilistic Interpretation of Linear Regression

Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.

Probabilistic Interpretation
Probabilistic Interpretation - (Bishop, 2006).
Session 7 - Neural Networks

Probabilistic Interpretation of Linear Regression


Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:



The functional relationship between and is given as:


Where is Gaussian noise with zero mean and variance .

We want to find that approximates the unknown function and generalises well.

Session 7 - Neural Networks

Probabilistic Interpretation of Linear Regression


For linear regression, we have:


Where is Gaussian noise with zero mean and variance . We seek the parameters .

So, given a training set of i.i.d input-output pairs:


The likelihood factorises according to:

Session 7 - Neural Networks

Probabilistic Interpretation of Linear Regression


The likelihood function tells us how likely the observed data is given the specific parameters:



We estimate by maximising the likelihood.

Session 7 - Neural Networks

Linear Basis Function Models

Hypothesis Space for Regression
Example functions described using a linear model.
Training Data Set
Training dataset.
Regression Fit
Linear regression fit.
Basis Functions
Basis Functions - (Bishop, 2006).
Session 7 - Neural Networks

Linear Basis Function Models

Linear basis function models extend linear regression by applying non-linear transformations to the input features while keeping the model linear in the parameters.

Session 7 - Neural Networks

Linear Basis Function Models


The model becomes:


Where is a vector of basis functions.

Basis Functions
Basis Functions - (Bishop, 2006).
Session 7 - Neural Networks

Linear Basis Function Models


Polynomial basis:


Gaussian basis:


Sigmoid basis:

Basis Functions
Basis Functions - (Bishop, 2006).
Session 7 - Neural Networks

Linear Basis Function Models


If we apply a probabilistic interpretation, we need to maximise the likelihood of:



After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:



And the normal equation solution:

Session 7 - Neural Networks

Linear Basis Function Models



The design matrix is constructed as:


By using basis functions , we can capture complex, non-linear relationships between the input features and the target variable. The design matrix essentially acts as a bridge, allowing us to apply linear techniques to problems that are inherently non-linear in nature. We can design the matrix and evaluate which design works better using cross-validation. This is essentially feature engineering.

Session 7 - Neural Networks

The Perceptron

The perceptron is one of the earliest and simplest artificial neural network models for binary classification.

Session 7 - Neural Networks

The Perceptron

Rosenblatt
Dr. Frank Rosenblatt and the Mark I - National Museum of the U.S. Navy, Public domain, via Wikimedia Commons.

The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.


By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".


The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.

Session 7 - Neural Networks

The Perceptron

Mark I
Mark I Perceptron - John C. Hay, Albert E. Murray, Public domain, via Wikimedia Commons

The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.


  • Input layer: An array of 400 photocells (20x20 grid) named "sensory units" or "input retina".
  • Hidden Layer: 512 perceptrons named "association units" or "A-units".
  • Output Layer: 8 perceptrons named "response units" or "R-units".
Session 7 - Neural Networks

The Perceptron

Perceptron Architecture
Perceptron Architecture.

The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:


The function is similar to the function previously defined. is given by a step function of the form:


We want to find :

Session 7 - Neural Networks

The Perceptron

Perceptron Step Function
Perceptron Step Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:


We need to derive the perceptron criterion. We know we are seeking parameters vector such that features in in class :


And for features in in class :

Session 7 - Neural Networks

The Perceptron

Perceptron Loss Function
Perceptron Loss Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:


Using , all features will satisfy:


The loss function is:

Session 7 - Neural Networks

The Perceptron


Perceptron Loss Function
The loss function is:


The update rule for a missclassified input is:


Session 7 - Neural Networks

The Perceptron


The perceptron learning algorithm is similar to the stochastic gradient descent.



Initialise weights w randomly
repeat
    for each training example (x, y)
        Compute prediction: y_pred = f(w·Φ(x))
        if y_pred ≠ y then
            Update weights: w = w + αΦ(x)y
until no misclassifications or max iterations
Session 7 - Neural Networks

The Perceptron

Perceptron Overview
Perceptron Overview - Andreas Maier, CC BY 4.0 , via Wikimedia Commons
Session 7 - Neural Networks

Neural Networks

Session 7 - Neural Networks

Neural Networks

Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.

Session 7 - Neural Networks

Neural Networks

Perceptron Architecture
Perceptron Architecture.
Session 7 - Neural Networks

Neural Networks

Perceptron Architecture
Perceptron Architecture.

Previous models:


  • Useful analytical and computational properties
  • Limited application because of the curse of dimensionality
Session 7 - Neural Networks

Neural Networks

Perceptron Architecture
Perceptron Architecture.

Large scale problems require that we adapt the basis functions to the data.

Session 7 - Neural Networks

Neural Networks

Neural Network
Neural Network Architecture.

Large scale problems require that we adapt the basis functions to the data.

Session 7 - Neural Networks

Neural Networks

Neural Network
Neural Network Architecture.

Neural Networks:


  • Fix the number of basis functions in advance
  • Basis functions are adaptive and their parameters can be updated during training

Training is costly but inference is cheap.

Session 7 - Neural Networks

Neural Networks

Neural Network
Neural Network Architecture.

Key Components:

  • Input Layer: Receives the input features
  • Hidden Layers: Process information through weighted connections
  • Output Layer: Produces the final prediction
  • Activation Functions: Introduce non-linearity
Session 7 - Neural Networks

Neural Networks

Neural Network
Neural Network Architecture.

The Universal Approximation Theorem states that a neural network with one hidden layer can approximate any continuous function on a compact subset of ℝⁿ, given sufficient neurons in the hidden layer because neural networks form complex decision boundaries through the combination of linear transformations and non-linear activation functions.

Session 7 - Neural Networks

Neural Networks


Feedforward Neural Network:

First, recall the general form of a linear model with nonlinear basis functions:


where is a nonlinear activation function (e.g., identity for regression, sigmoid for classification), are basis functions, and are weights.

Perceptron Architecture
Perceptron Architecture.
Session 7 - Neural Networks

Neural Networks


Feedforward Neural Network:

First, recall the general form of a linear model with nonlinear basis functions:


where is a nonlinear activation function (e.g., identity for regression, sigmoid for classification), are basis functions, and are weights.

In neural networks, the basis functions themselves are parameterized and learned. The network is constructed as a sequence of transformations.

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

1. Linear combination of inputs (first layer):

where and are the weights from input to hidden unit , is the bias for hidden unit .

Perceptron Architecture
Perceptron Architecture.
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

1. Linear combination of inputs (first layer):

where and are the weights from input to hidden unit , is the bias for hidden unit .

2. Nonlinear activation (hidden layer):


where is a nonlinear activation function (e.g., sigmoid, tanh, ReLU).

Perceptron Architecture
Perceptron Architecture.
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

1. Linear combination of inputs (first layer):

where and are the weights from input to hidden unit , is the bias for hidden unit .

2. Nonlinear activation (hidden layer):


where is a nonlinear activation function (e.g., sigmoid, tanh, ReLU).

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

3. Linear combination of hidden activations (output layer):

where and are the weights from hidden unit to output unit , is the bias for output unit .

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

3. Linear combination of hidden activations (output layer):

where and are the weights from hidden unit to output unit , is the bias for output unit .

Optionally, the output activations can be further transformed using an appropriate activation function to produce the final network outputs . For example, for regression problems or for binary classification problems.

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:


The bias parameters can be absorbed:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Feedforward Neural Network:

The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:


The bias parameters can be absorbed:


are continuous functions. The neural network is differentiable with respect to the parameters .

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns and relationships.

Session 7 - Neural Networks

Neural Networks

Session 7 - Neural Networks

Neural Networks

Sigmoid Function:


Range:

Derivative:


Logistic Curve

Logistic Curve - Qef, Public domain, via Wikimedia Commons.
Session 7 - Neural Networks

Neural Networks

Sigmoid Function:


Range:

Derivative:


Logistic Curve

Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Hyperbolic Tangent:


Range:

Derivative:


Hyperbolic Tangent

Hyperbolic Tangent - Geek3, CC BY-SA 3.0 , via Wikimedia Commons.
Session 7 - Neural Networks

Neural Networks

Sigmoid Function:


Range:

Derivative:


Logistic Curve

Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Hyperbolic Tangent:


Range:

Derivative:


Hyperbolic Tangent

Hyperbolic Tangent - Geek3, CC BY-SA 3.0 , via Wikimedia Commons.

ReLU (Rectified Linear Unit):


Range:

Derivative:


Ramp Function

Ramp Function - Qef, Public domain, via Wikimedia Commons.
Session 7 - Neural Networks

Neural Networks

Training Process:


Given a training set of N example input-output pairs


Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process:


Given a training set of N example input-output pairs



Each pair was generated by an unknown function :

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process:


Given a training set of N example input-output pairs



Each pair was generated by an unknown function :


We want to find a hypothesis that minimises the error function:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (regression problem):


Giving a probabilistic interpretation to the network outputs:



where is the precision (i.e. inverse variance ).

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (regression problem):


Giving a probabilistic interpretation to the network outputs:



where is the precision (i.e. inverse variance ).

For an i.i.d. training set, the likelihood function corresponds to:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (regression problem):


For an i.i.d. training set, the likelihood function corresponds to:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (regression problem):


For an i.i.d. training set, the likelihood function corresponds to:


As we saw in previous sessions, maximising the likelihood function is equivalent to minimising the sum-of-squares error function given by:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (binary classification):

The network output is:

We use a single target variable such that denotes class 1 and denotes class 2.

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (binary classification):

The network output is:

We use a single target variable such that denotes class 1 and denotes class 2. We interpret as the conditional probability distribution of targets given inputs:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Training Process (binary classification):

The network output is:

We use a single target variable such that denotes class 1 and denotes class 2. We interpret as the conditional probability distribution of targets given inputs:

For an i.i.d. training, the error function is the cross-entropy error:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 7 - Neural Networks

Neural Networks

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

We choose any starting point and then compute an estimate of the gradient. We then move a small amount in the steepest downhill direction, repeating until we converge on a point in the weight space with (local) minima loss.


Gradient Descent Algorithm:
Initialize w
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

The size of the step is given by the parameter α, which regulates the behaviour of the gradient descent algorithm. This is a hyperparameter of the regression model we are training, usually called learning rate.

Session 7 - Neural Networks

Neural Networks

Error backpropagation is an efficient algorithm for computing gradients in neural networks using the chain rule of calculus.

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

1. Forward Pass: Compute all activations and outputs for an input vector.

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

1. Forward Pass: Compute all activations and outputs for an input vector.

2. Error Evaluation: Evaluate the error for all the outputs using:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

1. Forward Pass: Compute all activations and outputs for an input vector.

2. Error Evaluation: Evaluate the error for all the outputs using:

3. Backward Pass: Backpropagate errors for each hidden unit in the network using:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

1. Forward Pass: Compute all activations and outputs for an input vector.

2. Error Evaluation: Evaluate the error for all the outputs using:

3. Backward Pass: Backpropagate errors for each hidden unit in the network using:

4. Derivatives Evaluation: Evaluate the derivatives for each parameter using:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Gradient Descent Update Rule:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Gradient Descent Update Rule:

Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Gradient Descent Update Rule:


Session 7 - Neural Networks

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Gradient Descent Update Rule:




Where is the learning rate.

Session 7 - Neural Networks

Deep Learning

Session 7 - Neural Networks

Deep Learning

Deep learning extends neural networks by using multiple hidden layers to learn hierarchical representations of data, enabling the automatic discovery of complex features.

Session 7 - Neural Networks

Deep Learning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning


Key Characteristics:


  • Multiple Hidden Layers: 3+ layers for deep architectures
  • Hierarchical Features: Each layer learns increasingly abstract representations
  • Automatic Feature Learning: No manual feature engineering required
  • Representation Learning: Learns useful representations from raw data
Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning


Vanishing and Exploding Gradients: In deep networks, gradients can become very small (vanishing) or very large (exploding) during backpropagation:

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning


Vanishing and Exploding Gradients: In deep networks, gradients can become very small (vanishing) or very large (exploding) during backpropagation:


  • Proper Weight Initialization: Xavier/Glorot initialization
  • Batch Normalization: Normalize activations during training
  • Modern Optimizers: Adam, RMSprop with adaptive learning rates
  • Regularisation: Dropout, L2, early stopping, and augmentation techniques
  • Network Architectures: Different architectures for different problems
Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Weight Initialization:


Xavier/Glorot Initialization:



He Initialization (for ReLU):



Where and are the number of input and output neurons respectively.

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Input normalisation:





Where (i.e., scale) and (i.e., shift) are learnable parameters, and is a small constant for numerical stability.

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Modern Optimisers:


Optimisers train models by efficiently navigating the loss landscape to find optimal parameters. They help in accelerating convergence, avoiding local minima, and improving generalisation.

  • Stochastic Gradient Descent (SGD): Updates parameters using a subset of data and reduces computation time.
  • Adam: Adaptively adjust the learning rate for each parameter.
  • ...
Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 7 - Neural Networks

Deep Learning

Regularization Techniques:


Dropout: Randomly deactivate neurons during training


L2 Regularization: Add penalty to loss function


Early Stopping: Stop training when validation loss increases


Data Augmentation: Create additional training examples through transformations

Session 7 - Neural Networks

Deep Learning

Network Architectures:

The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Network Architectures:

The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs

  • Residual Networks (ResNets): Allow gradients to flow through the network.
ResNet
ResNets - Xiaozhu0429, CC BY-SA 4.0 , via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Network Architectures:

The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs

  • Residual Networks (ResNets): Allow gradients to flow through the network.
  • Convolutional Neural Networks (CNNs): Used in image and video recognition tasks.
Convolutional Network
CNNs - Vincent Dumoulin, Francesco Visin, MIT , via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Network Architectures:

The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs

  • Residual Networks (ResNets): Allow gradients to flow through the network.
  • Convolutional Neural Networks (CNNs): Used in image and video recognition tasks.
  • Recurrent Neural Networks (RNNs): Applied in language modeling and sequence prediction tasks.
Recurrent Neural Network
RNNs - Zawersh, CC BY-SA 3.0 , via Wikimedia Commons
Session 7 - Neural Networks

Deep Learning

Network Architectures:

The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs

  • Residual Networks (ResNets): Allow gradients to flow through the network.
  • Convolutional Neural Networks (CNNs): Used in image and video recognition tasks.
  • Recurrent Neural Networks (RNNs): Applied in language modeling and sequence prediction tasks.
  • Generative Adversarial Networks (GANs): Used for generating realistic data samples, such as images and audio.
Generative Adversarial Network
GANs - Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J., CC BY-SA 4.0 , via Wikimedia Commons
Session 7 - Neural Networks

Conclusions

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 7 - Neural Networks

Conclusions

Overview

  • Neural Networks
  • Forward Passing
  • Backward Propagation
  • Deep Learning
  • Weights and Normalisation
  • Optimisers and Regularisation
  • ResNets, CNNs, and RNNs
Session 7 - Neural Networks

Conclusions

Overview

  • Neural Networks
  • Forward Passing
  • Backward Propagation
  • Deep Learning
  • Weights and Normalisation
  • Optimisers and Regularisation
  • ResNets, CNNs, and RNNs

Next Time

  • Reinforcement Learning (RL)
  • Markov Decision Processes (MDPs)
  • Value Functions and Bellman Equations
  • Policy and Value Iteration
  • RL Algorithms
  • Exploration vs. Exploitation
  • Applications of RL
Session 7 - Neural Networks

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true