h is called a hypothesis. We need to search for h in a hypotheses space H of possible functions
A consistent hypothesis maps each input to the corresponding grounded truth
We cannot expect exact match to the ground truth. We look for a best-fit function that generalises well.
The hypothesis h accurately predicts the outputs of unseen inputs (i.e., test set).
Our goal is to select a hypothesis h that will optimally fit future examples. Future examples will be like past examples (i.e., stationary assumption)
Each example has a the same prior probability distribution
Each example is independent of previous examples
Examples that satisfy these equations are independent and identically distributed (i.e., iid).
Our goal is to select a hypothesis h that will optimally fit future examples. h is optimally fit if it minimises the error rate
The error rate is the proportion of times that h produces the wrong output for an example
Finding a good hypothesis implies choosing a good hypothesis space H and optimising or finding the best hypothesis h at training.
Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)
Absolute value of the difference
Squared difference
Zero-One loss (classification error)
Underfitting occurs when our hypothesis space H is too simple to capture the true function f
Even the best hypothesis h* in H will have high error because:
where epsilon is some small positive number.
Overfitting occurs when our hypothesis space H is too complex, leading to:
Low empirical loss but high generalisation loss:
The hypothesis h* memorizes the training examples instead of learning the true function f.
Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting
Choosing the complexity function depends on the hypotheses space H. A good example for polynomials will be a function that returns the sum of the squares of coefficients.
Given the new cost function
The best hypothesis h* is the one that minimises the cost:
where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.
Alternatives to mitigate overfitting are feature selection, hyperparameter tuning, splitting dataset in training, validation, and test data (e.g., cross validation algorithm).
The regression problem involves predicting a continuous numerical value. Regression models approximate a function f that maps input features to a continuous output.
Linear Regression
Training Dataset
Hypothesis Space: All possible linear functions of continuous-valued inputs and outputs.
Hypothesis:
Loss Function:
Cost Function:
Linear Regression
Analytical Solution:
Gradient Descent Algorithm:
Initialize w randomly
repeat
for each w[i] in w
Compute gradient: g = ∇Loss(w[i])
Update weight: w[i] = w[i] - α * g
until convergence
Hyperparameters: Learning rate, number of epochs, and batch size.
Linear classifiers are models that separate data into classes using a linear decision boundary. These classifiers are functions that can decide if an input (i.e., vectors of numbers) belong to a specific class or not.
A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.
The hypothesis is the result of passing a linear function through a threshold function:
The optimal hypothesis is the one that minimises the loss function:
Where
Partial derivatives and gradient methods do not work. Instead, we apply perceptron learning rule.
Perceptron update rule:
The rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent).
An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.
Where
Gradient Descent for Logistic Regression
The gradient of the loss function is:
Update rule for
Where
Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.
Why go beyond linear?
Architecture (feedforward)
Information flows forward: inputs → hidden units → output.
Training a neural network
Hyperparameters: hidden size, learning rate, max iterations.
Precision vs explainability
No model removes uncertainty — there is no Laplace demon. We report what the model supports and what it cannot.
Train, validate, test
from sklearn.model_selection import train_test_split
# hold out the test set first
X_rest, X_test, y_rest, y_test = train_test_split(
X, y, test_size=0.20, random_state=42)
# split the rest into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_rest, y_rest, test_size=0.25, random_state=42)
Start simple: a linear model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
linear = LinearRegression().fit(X_train, y_train)
pred = linear.predict(X_val)
print("R2 :", r2_score(y_val, pred))
print("MAE:", mean_absolute_error(y_val, pred))
More flexible: a small neural network
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
scaler = StandardScaler().fit(X_train)
mlp = MLPRegressor(hidden_layer_sizes=(8,),
max_iter=500, random_state=42)
mlp.fit(scaler.transform(X_train), y_train)
pred = mlp.predict(scaler.transform(X_val))
Choose, then judge once
# choose on validation, judge once on test
if mae_linear <= mae_mlp:
model, name = linear, "linear"
else:
model, name = mlp, "mlp"
print("chosen:", name)
print("test R2:", r2_score(y_test, model.predict(X_test_ready)))
Deliverables
_script: true
This script will only execute in HTML slides
_script: true