Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.
Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.
Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:
The functional relationship between
Where
We want to find
For linear regression, we have:
Where
So, given a training set of
The likelihood factorises according to:
The likelihood function tells us how likely the observed data is given the specific parameters:
We estimate
Linear basis function models extend linear regression by applying non-linear transformations to the input features while keeping the model linear in the parameters.
The model becomes:
Where
Polynomial basis:
Gaussian basis:
Sigmoid basis:
If we apply a probabilistic interpretation, we need to maximise the likelihood of:
After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:
And the normal equation solution:
The design matrix
By using basis functions
The perceptron is one of the earliest and simplest artificial neural network models for binary classification.
The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.
By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".
The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.
The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.
The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:
The function
We want to find
We need to derive the perceptron criterion. We know we are seeking parameters vector
And for features in
Using
The loss function is:
The update rule for a missclassified input is:
The perceptron learning algorithm is similar to the stochastic gradient descent.
Initialise weights w randomly
repeat
for each training example (x, y)
Compute prediction: y_pred = f(w·Φ(x))
if y_pred ≠ y then
Update weights: w = w + αΦ(x)y
until no misclassifications or max iterations
Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.
Previous models:
Large scale problems require that we adapt the basis functions to the data.
Large scale problems require that we adapt the basis functions to the data.
Neural Networks:
Training is costly but inference is cheap.
Key Components:
The Universal Approximation Theorem states that a neural network with one hidden layer can approximate any continuous function on a compact subset of ℝⁿ, given sufficient neurons in the hidden layer because neural networks form complex decision boundaries through the combination of linear transformations and non-linear activation functions.
Feedforward Neural Network:
First, recall the general form of a linear model with nonlinear basis functions:
where
Feedforward Neural Network:
First, recall the general form of a linear model with nonlinear basis functions:
where
In neural networks, the basis functions themselves are parameterized and learned. The network is constructed as a sequence of transformations.
Feedforward Neural Network:
1. Linear combination of inputs (first layer):
where
Feedforward Neural Network:
1. Linear combination of inputs (first layer):
where
2. Nonlinear activation (hidden layer):
where
Feedforward Neural Network:
1. Linear combination of inputs (first layer):
where
2. Nonlinear activation (hidden layer):
where
Feedforward Neural Network:
3. Linear combination of hidden activations (output layer):
where
Feedforward Neural Network:
3. Linear combination of hidden activations (output layer):
where
Optionally, the output activations
Feedforward Neural Network:
The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:
Feedforward Neural Network:
The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:
The bias parameters can be absorbed:
Feedforward Neural Network:
The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:
The bias parameters can be absorbed:
Activation functions introduce non-linearity into the network, enabling it to learn complex patterns and relationships.
Sigmoid Function:
Range:
Derivative:
Sigmoid Function:
Range:
Derivative:
Hyperbolic Tangent:
Range:
Derivative:
Sigmoid Function:
Range:
Derivative:
Hyperbolic Tangent:
Range:
Derivative:
ReLU (Rectified Linear Unit):
Range:
Derivative:
Training Process:
Given a training set of N example input-output pairs
Training Process:
Given a training set of N example input-output pairs
Each pair was generated by an unknown function
Training Process:
Given a training set of N example input-output pairs
Each pair was generated by an unknown function
We want to find a hypothesis
Training Process (regression problem):
Giving a probabilistic interpretation to the network outputs:
where
Training Process (regression problem):
Giving a probabilistic interpretation to the network outputs:
where
For an i.i.d. training set, the likelihood function corresponds to:
Training Process (regression problem):
Training Process (regression problem):
As we saw in previous sessions, maximising the likelihood function is equivalent to minimising the sum-of-squares error function given by:
Training Process (binary classification):
The network output is:
We use a single target variable
Training Process (binary classification):
The network output is:
We use a single target variable
Training Process (binary classification):
The network output is:
We use a single target variable
For an i.i.d. training, the error function is the cross-entropy error:
We choose any starting point and then compute an estimate of the gradient. We then move a small amount in the steepest downhill direction, repeating until we converge on a point in the weight space with (local) minima loss.
Gradient Descent Algorithm:
Initialize w
repeat
for each w[i] in w
Compute gradient: g = ∇Loss(w[i])
Update weight: w[i] = w[i] - α * g
until convergence
The size of the step is given by the parameter α, which regulates the behaviour of the gradient descent algorithm. This is a hyperparameter of the regression model we are training, usually called learning rate.
Error backpropagation is an efficient algorithm for computing gradients in neural networks using the chain rule of calculus.
Error Backpropagation Algorithm:
Error Backpropagation Algorithm:
1. Forward Pass: Compute all activations and outputs for an input vector.
Error Backpropagation Algorithm:
1. Forward Pass: Compute all activations and outputs for an input vector.
2. Error Evaluation: Evaluate the error for all the outputs using:
Error Backpropagation Algorithm:
1. Forward Pass: Compute all activations and outputs for an input vector.
2. Error Evaluation: Evaluate the error for all the outputs using:
3. Backward Pass: Backpropagate errors for each hidden unit in the network using:
Error Backpropagation Algorithm:
1. Forward Pass: Compute all activations and outputs for an input vector.
2. Error Evaluation: Evaluate the error for all the outputs using:
3. Backward Pass: Backpropagate errors for each hidden unit in the network using:
4. Derivatives Evaluation: Evaluate the derivatives for each parameter using:
Gradient Descent Update Rule:
Gradient Descent Update Rule:
Gradient Descent Update Rule:
Gradient Descent Update Rule:
Where
Deep learning extends neural networks by using multiple hidden layers to learn hierarchical representations of data, enabling the automatic discovery of complex features.
Key Characteristics:
Vanishing and Exploding Gradients: In deep networks, gradients can become very small (vanishing) or very large (exploding) during backpropagation:
Vanishing and Exploding Gradients: In deep networks, gradients can become very small (vanishing) or very large (exploding) during backpropagation:
Weight Initialization:
Xavier/Glorot Initialization:
He Initialization (for ReLU):
Where
Input normalisation:
Where
Modern Optimisers:
Optimisers train models by efficiently navigating the loss landscape to find optimal parameters. They help in accelerating convergence, avoiding local minima, and improving generalisation.
Regularization Techniques:
Dropout: Randomly deactivate neurons during training
L2 Regularization: Add penalty to loss function
Early Stopping: Stop training when validation loss increases
Data Augmentation: Create additional training examples through transformations
Network Architectures:
The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs
Network Architectures:
The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs
Network Architectures:
The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs
Network Architectures:
The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs
Network Architectures:
The architecture impacts the model's performance. The selection criteria includes: data type, task complexity, computational resources, generalisation needs
_script: true
This script will only execute in HTML slides
_script: true