After assessing the data (i.e., data assess), we need to use the data to address the problem in question. This process includes implementing a Machine Learning algorithm that creates a Machine Learning model.
A Machine Learning algorithm is a set of instructions that are used to train a machine learning model. It defines how the model learns from data and makes predictions or decisions. Linear regression, decision trees, and neural networks are examples of machine learning algorithms.
A Machine Learning model is a program that is trained on a dataset and used to make predictions or decisions. The goal is to create a trained model that can generalise well to new, unseen data. For example, a trained model could predict house prices based on new input features.
A Machine Learning algorithm uses the training process that goes from a specific set of observations to a general rule (i.e., induction). This process adjusts the Machine Learning model internal parameters to minimise prediction errors. For example, in a linear regression model, the algorithm adjusts the slope and intercept to minimise the difference between predicted and actual values.
The parameters of Machine Learning models are adjusted according to the training data (i.e., seen data). For example, when training a model to predict house prices, the training data would include features like square footage, number of bedrooms, and location, along with their actual sale prices. The training data consists of vectors of attribute values.
In classification problems, the prediction (i.e., model's output) is one of a finite set of values (e.g., sunny/cloudy/rainy or true/false). In the regression problems, the model's output is a number.
Predictions can deviate from the expected values. Prediction errors are quantified by a loss function that indicates to the algorithm how far the prediction is from the target value. This difference is used to update the machine learning model's internal parameters to minimise the prediction errors. Common examples include Mean Squared Error for regression problems and Cross-Entropy for classification problems.
Three types of feedback can be part of the inputs in our training process:
The agent observes input-output pairs and learns a function that maps from input to output (label).
Given a training set of N example input-output pairs
Each pair was generated by an unknown function f.
The goal is to discover a function h that approximates the true function f.
Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth).
Absolute value of the difference
Squared difference
Zero-One loss (classification error)
As P(x,y) is unknown, we can only estimate an empirical loss on a set of examples E of size N
Empirical loss for a hypothesis h using loss function L:
The best hypothesis h* is the one with the minimum empirical loss
Underfitting occurs when our hypothesis space H is too simple to capture the true function f
Even the best hypothesis h* in H will have high error because:
where epsilon is some small positive number.
Overfitting occurs when our hypothesis space H is too complex, leading to:
Low empirical loss but high generalisation loss:
The hypothesis h* memorizes the training examples instead of learning the true function f.
Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting.
Choosing the complexity function depends on the hypotheses space H. A good example for polynomials will be a function that returns the sum of the squares of coefficients.
Given the new cost function
The best hypothesis h* is the one that minimises the cost:
where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.
Alternatives to mitigate overfitting are feature selection, hyperparameter tuning, splitting dataset in training, validation, and test data (e.g., cross validation algorithm).
The regression problem involves predicting a continuous numerical value. Regression models approximate a function f that maps input features to a continuous output.
Linear Regression
Training Dataset
Hypothesis Space: All possible linear functions of continuous-valued inputs and outputs.
Hypothesis:
Loss Function:
Cost Function:
Linear Regression
Analytical Solution:
Gradient Descent Algorithm:
Initialize w randomly
repeat
for each w[i] in w
Compute gradient: g = ∇Loss(w[i])
Update weight: w[i] = w[i] - α * g
until convergence
Hyperparameters: Learning rate, number of epochs, and batch size.
Multivariate regression extends the simple linear model to handle multiple input features. The model learns a function that maps multiple input variables to a continuous output.
In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.
In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.
We want to find the
In these problems, each example is a n-element vector. The hypotheses space H now includes linear functions of multiple continuous-valued inputs and a single continuous output.
We want to find the
In vector notation:
Where
The loss function for multivariate regression is:
The loss function for multivariate regression is:
In matrix notation:
Where
Given the Loss function:
Given the Loss function:
We want to find
Given the Loss function:
We want to find
Taking the derivative with respect to
Given the Loss function:
We want to find
Taking the derivative with respect to
This leads to the normal equation:
Gradient Descent for Multivariate Regression
The gradient of the loss function is:
Gradient Descent for Multivariate Regression
The gradient of the loss function is:
Update rule for
Where
Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.
Linear regression can be interpreted from a probabilistic perspective, where we model the uncertainty in our predictions using probability distributions.
Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:
Instead of assuming deterministic relationships, we model the regression problem with the likelihood function:
The functional relationship between
Where
We want to find
For linear regression, we have:
Where
For linear regression, we have:
Where
So, given a training set of
For linear regression, we have:
Where
So, given a training set of
The likelihood factorises according to:
The likelihood function tells us how likely the observed data is given the specific parameters:
The likelihood function tells us how likely the observed data is given the specific parameters:
We estimate
As before, a closed-form solution exists, which makes gradient descent unnecessary. We apply the log transformation to the likelihood function and minimise the negative log-likelihood.
As before, a closed-form solution exists, which makes gradient descent unnecessary. We apply the log transformation to the likelihood function and minimise the negative log-likelihood.
We have that:
The loss function is defined as:
Where
Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).
Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).
As we did before, we compute the gradient of the Loss and equate it to zero:
Minimising the Loss function is equivalent to minimising the sum of squared errors (MSE).
As we did before, we compute the gradient of the Loss and equate it to zero:
Solving the derivatives as we did before:
We want to maximise the likelihood function of the training data given the model parameters.
Closed-form solution for linear regression:
Where
Linear basis function models extend linear regression by applying non-linear transformations to the input features while keeping the model linear in the parameters.
The model becomes:
Where
Polynomial basis:
Gaussian basis:
Sigmoid basis:
If we apply a probabilistic interpretation, we need to maximise the likelihood of:
If we apply a probabilistic interpretation, we need to maximise the likelihood of:
After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:
If we apply a probabilistic interpretation, we need to maximise the likelihood of:
After a similar process (See Chapter 3 in Bishop, 2006), the loss function becomes:
And the normal equation solution:
The design matrix
By using basis functions
Cross-validation is a technique for assessing model performance and preventing overfitting by evaluating the model on unseen data.
K-Fold Cross-validation divides the dataset into K equal parts:
for (int k = 1; k <= K; k++) {
trainOnAllFoldsExcept(k);
evaluateOnFold(k);
recordPerformanceMetric(k);
}
calculateAveragePerformance(K);
K-Fold Cross-validation divides the dataset into K equal parts:
for (int k = 1; k <= K; k++) {
trainOnAllFoldsExcept(k);
evaluateOnFold(k);
recordPerformanceMetric(k);
}
calculateAveragePerformance(K);
Mathematically, for K-fold CV:
Where
Leave-One-Out Cross-validation (LOOCV) is a special case where K = N:
Where
Linear classifiers are models that separate data into classes using a linear decision boundary. These classifiers are functions that can decide if an input (i.e., vectors of numbers) belong to a specific class or not.
A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.
A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.
The hypothesis is the result of passing a linear function through a threshold function:
A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.
The hypothesis is the result of passing a linear function through a threshold function:
The optimal hypothesis is the one that minimises the loss function:
Where
The optimal hypothesis is the one that minimises the loss function:
Where
Partial derivatives and gradient methods do not work. Instead, we apply perceptron learning rule.
Perceptron update rule:
The rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent).
An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable.
An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable.
An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.
Where
An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.
Where
Following the gradient descent algorithm, the update rule is:
Following the gradient descent algorithm, the update rule is:
The loss function is represented as a composition of functions:
Following the gradient descent algorithm, the update rule is:
The loss function is represented as a composition of functions:
We need to differentiate the loss function using the chain rule to compute the gradients:
Following the gradient descent algorithm, the update rule is:
Following the gradient descent algorithm, the update rule is:
The resulting update rule after solving the partial derivatives:
The gradient descent algorithm is applied using this update rule.
Gradient Descent for Logistic Regression
The gradient of the loss function is:
Update rule for
Where
The perceptron is one of the earliest and simplest artificial neural network models for binary classification.
The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.
The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.
By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".
The perceptron was simulated by Frank Rosenblatt in 1957 on an IBM 704 machine as a model for biological neural networks. The neural network was invented in 1943 by McCulloch & Pitts.
By 1962 Rosenblatt published the "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms".
The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.
The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.
The Perceptron machine is named Mark I. It is a special-purpose hardware that implemented the perceptron supervised learning for image recognition.
The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:
The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:
The function
The perceptron is a linear classifier model (i.e., linear discriminant), with hypothesis space defined by all the functions of the form:
The function
We want to find
We need to derive the perceptron criterion. We know we are seeking parameters vector
And for features in
Using
Using
The loss function is:
The update rule for a missclassified input is:
The perceptron learning algorithm is similar to the stochastic gradient descent.
Initialise weights w randomly
repeat
for each training example (x, y)
Compute prediction: y_pred = f(w·Φ(x))
if y_pred ≠y then
Update weights: w = w + αΦ(x)y
until no misclassifications or max iterations
import numpy as np
class Perceptron:
def __init__(self, learning_rate=0.01, max_iterations=1000):
self.learning_rate = learning_rate
self.max_iterations = max_iterations
self.weights = None
self.bias = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.max_iterations):
misclassified = 0
for i in range(n_samples):
prediction = self.predict(X[i])
if prediction != y[i]:
self.weights += self.learning_rate * (y[i] - prediction) * X[i]
self.bias += self.learning_rate * (y[i] - prediction)
misclassified += 1
if misclassified == 0:
break
def predict(self, x):
return 1 if np.dot(self.weights, x) + self.bias >= 0 else 0
Python Implementation
XOR Problem:
The perceptron cannot learn the XOR function because it's not linearly separable.
Other limitations:
_script: true
This script will only execute in HTML slides
_script: true