Session 6 - The Perceptron

The Perceptron

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 6 - The Perceptron

Last Time

Session 6 - The Perceptron

ML Definition

Session 6 - The Perceptron

The Data Science Process

Data Science Process
Session 6 - The Perceptron

Data Address

Session 6 - The Perceptron

Data Address

After assessing the data (i.e., data assess), we need to use the data to address the problem in question. This process includes implementing a Machine Learning algorithm that creates a Machine Learning model.

Session 6 - The Perceptron

Data Address

Data Assess Pipeline
Session 6 - The Perceptron

Data Address

A Machine Learning algorithm is a set of instructions that are used to train a machine learning model. It defines how the model learns from data and makes predictions or decisions. Linear regression, decision trees, and neural networks are examples of machine learning algorithms.

ML Algorithm
Session 6 - The Perceptron

Data Address

A Machine Learning model is a program that is trained on a dataset and used to make predictions or decisions. The goal is to create a trained model that can generalise well to new, unseen data. For example, a trained model could predict house prices based on new input features.

ML Model
Session 6 - The Perceptron

Data Address

A Machine Learning algorithm uses the training process that goes from a specific set of observations to a general rule (i.e., induction). This process adjusts the Machine Learning model internal parameters to minimise prediction errors. For example, in a linear regression model, the algorithm adjusts the slope and intercept to minimise the difference between predicted and actual values.

Training Process
Session 6 - The Perceptron

Data Address

The parameters of Machine Learning models are adjusted according to the training data (i.e., seen data). For example, when training a model to predict house prices, the training data would include features like square footage, number of bedrooms, and location, along with their actual sale prices. The training data consists of vectors of attribute values.

Training Process
Session 6 - The Perceptron

Data Address

In classification problems, the prediction (i.e., model's output) is one of a finite set of values (e.g., sunny/cloudy/rainy or true/false). In the regression problems, the model's output is a number.

ML Model
Session 6 - The Perceptron

Data Address

Predictions can deviate from the expected values. Prediction errors are quantified by a loss function that indicates to the algorithm how far the prediction is from the target value. This difference is used to update the machine learning model's internal parameters to minimise the prediction errors. Common examples include Mean Squared Error for regression problems and Cross-Entropy for classification problems.

Training Process
Session 6 - The Perceptron

Data Address

Three types of feedback can be part of the inputs in our training process:

  • Supervised Learning: The model is trained on labeled data to learn a mapping between inputs and the corresponding labels, so the model can make predictions on unseen data.
  • Unsupervised Learning: The model is trained on unlabeled data, and it must find patterns in the data. The goal is to identify hidden structures or groupings in the data.
  • Reinforcement Learning: The model learns by interacting with an environment and receiving rewards or penalties. The goal is to learn a policy that maximizes the reward.
Training Process
Session 6 - The Perceptron

Supervised Learning

The agent observes input-output pairs and learns a function that maps from input to output (label).

Session 6 - The Perceptron

Supervised Learning


Given a training set of N example input-output pairs



Each pair was generated by an unknown function f.



The goal is to discover a function h that approximates the true function f.

Session 6 - The Perceptron

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth).


Absolute value of the difference


Squared difference


Zero-One loss (classification error)

Session 6 - The Perceptron

Supervised Learning


As P(x,y) is unknown, we can only estimate an empirical loss on a set of examples E of size N


Empirical loss for a hypothesis h using loss function L:



The best hypothesis h* is the one with the minimum empirical loss


Session 6 - The Perceptron

Supervised Learning

Underfitted Model
Underfitted Model - AAStein, CC BY-SA 4.0 , via Wikimedia Commons

Underfitting occurs when our hypothesis space H is too simple to capture the true function f


Even the best hypothesis h* in H will have high error because:



where epsilon is some small positive number.

Session 6 - The Perceptron

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Overfitting occurs when our hypothesis space H is too complex, leading to:


Low empirical loss but high generalisation loss:



The hypothesis h* memorizes the training examples instead of learning the true function f.

Session 6 - The Perceptron

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting.


Choosing the complexity function depends on the hypotheses space H. A good example for polynomials will be a function that returns the sum of the squares of coefficients.

Session 6 - The Perceptron

Supervised Learning

Given the new cost function



The best hypothesis h* is the one that minimises the cost:


where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.


Alternatives to mitigate overfitting are feature selection, hyperparameter tuning, splitting dataset in training, validation, and test data (e.g., cross validation algorithm).

Session 6 - The Perceptron

Regression

The regression problem involves predicting a continuous numerical value. Regression models approximate a function f that maps input features to a continuous output.

Session 6 - The Perceptron

Regression Models

Regression Fit
Linear regression fit.

Linear Regression

Training Dataset

Hypothesis Space: All possible linear functions of continuous-valued inputs and outputs.

Hypothesis:

Loss Function:

Cost Function:

Session 6 - The Perceptron

Regression Models

Hypothesis Space for Regression
Example functions described using a linear model.
Training Data Set
Training dataset.
Regression Fit
Linear regression fit.
Session 6 - The Perceptron

Regression Models

Convex Function
Weights Space - Convex Loss Function.
Non Convex Function
Non Convex Function - Zachary kaplan, CC BY-SA 4.0 , via Wikimedia Commons.
Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 6 - The Perceptron

Regression Models

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Linear Regression

Analytical Solution:

Gradient Descent Algorithm:


Initialize w randomly
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

Hyperparameters: Learning rate, number of epochs, and batch size.

Session 6 - The Perceptron

Multivariate Linear Regression

Session 6 - The Perceptron

Multivariate Linear Regression

Multivariate regression extends the simple linear model to handle multiple input features. The model learns a function that maps multiple input variables to a continuous output.

Session 6 - The Perceptron

Multivariate Linear Regression


In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.


Session 6 - The Perceptron

Multivariate Linear Regression


In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.



We want to find the that best fits the data.

Session 6 - The Perceptron

Multivariate Linear Regression


In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.



We want to find the that best fits the data.


In vector notation:


Where is the feature vector, is the parameters vector, and is a bias term.

Session 6 - The Perceptron

Multivariate Linear Regression


The loss function for multivariate regression is:


Session 6 - The Perceptron

Multivariate Linear Regression


The loss function for multivariate regression is:



In matrix notation:



Where is the target vector, is the feature matrix, and is the weight vector.

Session 6 - The Perceptron

Multivariate Linear Regression


Given the Loss function:

Session 6 - The Perceptron

Multivariate Linear Regression


Given the Loss function:


We want to find

Session 6 - The Perceptron

Multivariate Linear Regression


Given the Loss function:


We want to find


Taking the derivative with respect to and setting to zero:

Session 6 - The Perceptron

Multivariate Linear Regression


Given the Loss function:


We want to find


Taking the derivative with respect to and setting to zero:


This leads to the normal equation:

Session 6 - The Perceptron

Multivariate Linear Regression

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Gradient Descent for Multivariate Regression

The gradient of the loss function is:


Session 6 - The Perceptron

Multivariate Linear Regression

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Gradient Descent for Multivariate Regression

The gradient of the loss function is:



Update rule for :



Where is the learning rate.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression

Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression

Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.

Probabilistic Interpretation
Probabilistic Interpretation - (Bishop, 2006).
Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:


Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:



The functional relationship between and is given as:


Where is Gaussian noise with zero mean and variance .

We want to find that approximates the unknown function and generalises well.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


For linear regression, we have:


Where is Gaussian noise with zero mean and variance . We seek the parameters .

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


For linear regression, we have:


Where is Gaussian noise with zero mean and variance . We seek the parameters .

So, given a training set of i.i.d input-output pairs:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


For linear regression, we have:


Where is Gaussian noise with zero mean and variance . We seek the parameters .

So, given a training set of i.i.d input-output pairs:


The likelihood factorises according to:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


The likelihood function tells us how likely the observed data is given the specific parameters:


Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


The likelihood function tells us how likely the observed data is given the specific parameters:



We estimate by maximising the likelihood.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression



As before, a closed-form solution exists, which makes gradient descent unnecessary. We apply the log transformation to the likelihood function and minimise the negative log-likelihood.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression



As before, a closed-form solution exists, which makes gradient descent unnecessary. We apply the log transformation to the likelihood function and minimise the negative log-likelihood.


We have that:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Ignoring the constant terms:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Ignoring the constant terms:


The loss function is defined as:


Where is the design matrix as the collection of training inputs and is a vector of all targets.

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).


Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).



As we did before, we compute the gradient of the Loss and equate it to zero:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).



As we did before, we compute the gradient of the Loss and equate it to zero:


Solving the derivatives as we did before:

Session 6 - The Perceptron

Probabilistic Interpretation of Linear Regression


The regression problem is considered as:


We want to maximise the likelihood function of the training data given the model parameters.


Closed-form solution for linear regression:


Where is the design matrix as the collection of training inputs and is a vector of all targets.

Probabilistic Interpretation
Probabilistic Interpretation - (Bishop, 2006).
Session 6 - The Perceptron

Linear Basis Function Models

Session 6 - The Perceptron

Linear Basis Function Models

Hypothesis Space for Regression
Example functions described using a linear model.
Training Data Set
Training dataset.
Regression Fit
Linear regression fit.
Basis Functions
Basis Functions - (Bishop, 2006).
Session 6 - The Perceptron

Linear Basis Function Models

Linear basis function models extend linear regression by applying non-linear transformations to the input features while keeping the model linear in the parameters.

Session 6 - The Perceptron

Linear Basis Function Models


The model becomes:


Where is a vector of basis functions.

Basis Functions
Basis Functions - (Bishop, 2006).
Session 6 - The Perceptron

Linear Basis Function Models


Polynomial basis:


Gaussian basis:


Sigmoid basis:

Basis Functions
Basis Functions - (Bishop, 2006).
Session 6 - The Perceptron

Linear Basis Function Models


If we apply a probabilistic interpretation, we need to maximise the likelihood of:


Session 6 - The Perceptron

Linear Basis Function Models


If we apply a probabilistic interpretation, we need to maximise the likelihood of:



After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:


Session 6 - The Perceptron

Linear Basis Function Models


If we apply a probabilistic interpretation, we need to maximise the likelihood of:



After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:



And the normal equation solution:

Session 6 - The Perceptron

Linear Basis Function Models


Session 6 - The Perceptron

Linear Basis Function Models



The design matrix is constructed as:


By using basis functions , we can capture complex, non-linear relationships between the input features and the target variable. The design matrix essentially acts as a bridge, allowing us to apply linear techniques to problems that are inherently non-linear in nature. We can design the matrix and evaluate which design works better using cross-validation. This is essentially feature engineering.

Session 6 - The Perceptron

Cross Validation

Session 6 - The Perceptron

Cross-validation

Cross-validation is a technique for assessing model performance and preventing overfitting by evaluating the model on unseen data.

Session 6 - The Perceptron

Cross-validation

K-Fold
K-fold CV - MBanuelos22, CC BY-SA 4.0 , via Wikimedia Commons.
Session 6 - The Perceptron

Cross-validation


K-Fold Cross-validation divides the dataset into K equal parts:

for (int k = 1; k <= K; k++) {
    trainOnAllFoldsExcept(k);
    evaluateOnFold(k);
    recordPerformanceMetric(k);
}
calculateAveragePerformance(K);
K-Fold
K-fold CV - MBanuelos22, CC BY-SA 4.0 , via Wikimedia Commons.
Session 6 - The Perceptron

Cross-validation


K-Fold Cross-validation divides the dataset into K equal parts:

for (int k = 1; k <= K; k++) {
    trainOnAllFoldsExcept(k);
    evaluateOnFold(k);
    recordPerformanceMetric(k);
}
calculateAveragePerformance(K);

Mathematically, for K-fold CV:

Where is the loss on fold when training on all other folds.

K-Fold
K-fold CV - MBanuelos22, CC BY-SA 4.0 , via Wikimedia Commons.
Session 6 - The Perceptron

Cross-validation


Leave-One-Out Cross-validation (LOOCV) is a special case where K = N:


Where is trained on all data points except .

LOOCV
LOOCV - MBanuelos22, CC BY-SA 4.0 , via Wikimedia Commons.
Session 6 - The Perceptron

Linear Classifiers

Session 6 - The Perceptron

Linear Classifiers

Linear classifiers are models that separate data into classes using a linear decision boundary. These classifiers are functions that can decide if an input (i.e., vectors of numbers) belong to a specific class or not.

Session 6 - The Perceptron

Linear Classifiers

SVM Separating Hyperplanes
SVM Separating Hyperplanes - Cyc, Public domain, via Wikimedia Commons.

A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.

Session 6 - The Perceptron

Linear Classifiers

SVM Separating Hyperplanes
SVM Separating Hyperplanes - Cyc, Public domain, via Wikimedia Commons.

A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.


The hypothesis is the result of passing a linear function through a threshold function:


Session 6 - The Perceptron

Linear Classifiers

SVM Separating Hyperplanes
SVM Separating Hyperplanes - Cyc, Public domain, via Wikimedia Commons.

A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.


The hypothesis is the result of passing a linear function through a threshold function:


Regression Fit
Threshold Function.

Session 6 - The Perceptron

Linear Classifiers


The optimal hypothesis is the one that minimises the loss function:



Where is the hypothesis space.

Regression Fit
Threshold Function.

Session 6 - The Perceptron

Linear Classifiers


The optimal hypothesis is the one that minimises the loss function:



Where is the hypothesis space.


Partial derivatives and gradient methods do not work. Instead, we apply perceptron learning rule.

Regression Fit
Threshold Function.

Session 6 - The Perceptron

Linear Classifiers


Perceptron update rule:



The rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent).

Regression Fit
Threshold Function.

Session 6 - The Perceptron

Linear Classifiers


An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable.

Regression Fit
Threshold Function.
Logistic Curve
Logistic Curve - Qef, Public domain, via Wikimedia Commons.
Session 6 - The Perceptron

Linear Classifiers


An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable.

Logistic Curve
Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Session 6 - The Perceptron

Linear Classifiers


An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.




Where is the hypothesis space.

Logistic Curve
Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Session 6 - The Perceptron

Linear Classifiers


An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.




Where is the hypothesis space.

Logistic Curve
Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Session 6 - The Perceptron

Linear Classifiers

Following the gradient descent algorithm, the update rule is:


Session 6 - The Perceptron

Linear Classifiers

Following the gradient descent algorithm, the update rule is:



The loss function is represented as a composition of functions:


Session 6 - The Perceptron

Linear Classifiers

Following the gradient descent algorithm, the update rule is:



The loss function is represented as a composition of functions:



We need to differentiate the loss function using the chain rule to compute the gradients:


Session 6 - The Perceptron

Linear Classifiers

Following the gradient descent algorithm, the update rule is:


Session 6 - The Perceptron

Linear Classifiers

Following the gradient descent algorithm, the update rule is:



The resulting update rule after solving the partial derivatives:



The gradient descent algorithm is applied using this update rule.

Session 6 - The Perceptron

Linear Classifiers

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Gradient Descent for Logistic Regression

The gradient of the loss function is:



Update rule for :



Where is the learning rate.



Session 6 - The Perceptron

The Perceptron

Session 6 - The Perceptron

The Perceptron

The perceptron is one of the earliest and simplest artificial neural network models for binary classification.

Session 6 - The Perceptron

The Perceptron

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCharty, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblant, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumerlhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997) Deep Learning (Hinton, 2006) Watson wins Jeopardy (2011) AlexNet (Krizhevsky, 2012) GANs (Goodfellow, 2014) AlphaGo beats Lee Sedol (DeepMind, 2016) Transformer (Vaswani, 2017) AlphaFold (DeepMind, 2018) GPT-1 (OpenAI, 2020) BERT (Google, 2019) Chinchilla (DeepMind, 2022) ChatGPT (OpenAI, 2022) LLaMA (Meta AI, 2023) Claude 2 (Anthropic, 2023) phi-3 (Microsoft, 2024) Gemini 1.5 (Google DeepMind, 2024) Qwen3 (Alibaba, 2025) R1 (DeepSeek) 2025)
Session 6 - The Perceptron

The Perceptron

Rosenblatt
Dr. Frank Rosenblatt and the Mark I - National Museum of the U.S. Navy, Public domain, via Wikimedia Commons.

The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.

Session 6 - The Perceptron

The Perceptron

Rosenblatt
Dr. Frank Rosenblatt and the Mark I - National Museum of the U.S. Navy, Public domain, via Wikimedia Commons.

The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.


By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".

Session 6 - The Perceptron

The Perceptron

Rosenblatt
Dr. Frank Rosenblatt and the Mark I - National Museum of the U.S. Navy, Public domain, via Wikimedia Commons.

The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.


By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".


The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.

Session 6 - The Perceptron

The Perceptron

Rosenblatt
Dr. Frank Rosenblatt and the Mark I - National Museum of the U.S. Navy, Public domain, via Wikimedia Commons.

The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.

Session 6 - The Perceptron

The Perceptron

Mark I
Mark I Perceptron - John C. Hay, Albert E. Murray, Public domain, via Wikimedia Commons

The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.


  • Input layer: An array of 400 photocells (20x20 grid) named "sensory units" or "input retina".
  • Hidden Layer: 512 perceptrons named "association units" or "A-units".
  • Output Layer: 8 perceptrons named "response units" or "R-units".
Session 6 - The Perceptron

The Perceptron

Mark I
Mark I - (Bishop, 2006).
Session 6 - The Perceptron

The Perceptron

Biological Brain vs Perceptron - Rosenblatt, F., Public domain, via Wikimedia Commons
Biological Brain vs Perceptron
Session 6 - The Perceptron

The Perceptron

Perceptron Architecture
Perceptron Architecture.

The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:

Session 6 - The Perceptron

The Perceptron

Perceptron Architecture
Perceptron Architecture.

The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:


The function is similar to the function previously defined. is given by a step function of the form:

Session 6 - The Perceptron

The Perceptron

Perceptron Architecture
Perceptron Architecture.

The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:


The function is similar to the function previously defined. is given by a step function of the form:


We want to find :

Session 6 - The Perceptron

The Perceptron

Perceptron Step Function
Perceptron Step Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:

Session 6 - The Perceptron

The Perceptron

Perceptron Step Function
Perceptron Step Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:


We need to derive the perceptron criterion. We know we are seeking parameters vector such that features in in class :


And for features in in class :

Session 6 - The Perceptron

The Perceptron

Perceptron Step Function
Perceptron Step Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:


Using , all features will satisfy:

Session 6 - The Perceptron

The Perceptron

Perceptron Loss Function
Perceptron Loss Function.

The perceptron step function is not differentiable and the gradient is zero almost everywhere:


Using , all features will satisfy:


The loss function is:

Session 6 - The Perceptron

The Perceptron


Perceptron Loss Function
The loss function is:

Session 6 - The Perceptron

The Perceptron


Perceptron Loss Function
The loss function is:


The update rule for a missclassified input is:


Session 6 - The Perceptron

The Perceptron


The perceptron learning algorithm is similar to the stochastic gradient descent.



Initialise weights w randomly
repeat
    for each training example (x, y)
        Compute prediction: y_pred = f(w·Φ(x))
        if y_pred ≠ y then
            Update weights: w = w + αΦ(x)y
until no misclassifications or max iterations
Session 6 - The Perceptron

The Perceptron

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.weights = None
        self.bias = None
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        for _ in range(self.max_iterations):
            misclassified = 0
            for i in range(n_samples):
                prediction = self.predict(X[i])
                if prediction != y[i]:
                    self.weights += self.learning_rate * (y[i] - prediction) * X[i]
                    self.bias += self.learning_rate * (y[i] - prediction)
                    misclassified += 1
            if misclassified == 0:
                break
    def predict(self, x):
        return 1 if np.dot(self.weights, x) + self.bias >= 0 else 0

Python Implementation

Session 6 - The Perceptron

The Perceptron

XOR Problem
XOR Problem - Andreas Maier, CC BY 4.0 , via Wikimedia Commons

XOR Problem:


The perceptron cannot learn the XOR function because it's not linearly separable.

(Minsky & Papert, 1969)


Other limitations:

  • Only binary classification
  • No probabilistic output
  • The step function is not differentiable
  • May oscillate for non-separable data
Session 6 - The Perceptron

The Perceptron

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCharty, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblant, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumerlhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997) Deep Learning (Hinton, 2006) Watson wins Jeopardy (2011) AlexNet (Krizhevsky, 2012) GANs (Goodfellow, 2014) AlphaGo beats Lee Sedol (DeepMind, 2016) Transformer (Vaswani, 2017) AlphaFold (DeepMind, 2018) GPT-1 (OpenAI, 2020) BERT (Google, 2019) Chinchilla (DeepMind, 2022) ChatGPT (OpenAI, 2022) LLaMA (Meta AI, 2023) Claude 2 (Anthropic, 2023) phi-3 (Microsoft, 2024) Gemini 1.5 (Google DeepMind, 2024) Qwen3 (Alibaba, 2025) R1 (DeepSeek) 2025)
Session 6 - The Perceptron

The Perceptron

Perceptron Overview
Perceptron Overview - Andreas Maier, CC BY 4.0 , via Wikimedia Commons
Session 6 - The Perceptron

Conclusions

Overview

  • Multivariate Linear Regression
  • Probabilistic Interpretation of Linear Models
  • Linear Basis Function Models
  • Cross Validation
  • Linear Classifiers
  • The Perceptron
Session 6 - The Perceptron

Conclusions

Overview

  • Multivariate Linear Regression
  • Probabilistic Interpretation of Linear Models
  • Linear Basis Function Models
  • Cross Validation
  • Linear Classifiers
  • The Perceptron

Next Time

  • Neural Networks
  • Deep Learning
Session 6 - The Perceptron

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true