Session 5 - Learning from Data

Learning from Data

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 5 - Learning from Data

Last Time

Session 5 - Learning from Data

The Data Science Process

Data Science Process
Session 5 - Learning from Data

Data Quality

Data quality refers to the state of data in terms of its fitness for a purpose

Data Quality
Session 5 - Learning from Data

Data Quality

Data quality refers to the state of data in terms of its fitness for a purpose

This is a multi-dimensional concept

  • Accuracy
  • Completeness
  • Uniqueness
  • Consistency
  • Timeliness
  • Validity
Session 5 - Learning from Data

Data Quality

1. Missing Values
  • Incomplete records
  • Null or NaN values
  • Empty fields
2. Inconsistencies
  • Format variations
  • Unit mismatches
  • Naming conventions
3. Outliers
  • Extreme values
  • Measurement errors
  • Data entry mistakes
4. Noise
  • Random variations
  • Measurement errors
  • Background interference
5. Bias
  • Sampling bias
  • Selection bias
  • Measurement bias
  • ...
Session 5 - Learning from Data

Data Quality

Data Usability

Poor data quality can lead to:

  • Inaccurate model predictions and reduced reliability
  • Biased results
  • Reduced generalisability
  • Increased training time
  • Higher costs
  • Intellectual debt
  • ...
Session 5 - Learning from Data

Data Assess

After collecting the data (i.e., data access), we need to perform a data assessment process to understand the data, identify and mitigate data quality issues, uncover patterns, and gain insights.

Session 5 - Learning from Data

Data Assess

Data Assess Pipeline
Session 5 - Learning from Data

ML Pipelines vs ML-based Systems

Data Assess Pipeline
ML-Based System
Session 5 - Learning from Data

Data Assess

Data Assess Pipeline
Session 5 - Learning from Data

Data Cleaning

Process of detecting and correcting (or removing) corrupt or inaccurate records.

Missing Datapoints:

  • Removing rows or columns with missing values
  • Mean imputation
  • Median imputation
  • Regression imputation
  • ...

Outliers:

  • Identify outliers
  • Capping
  • Log transformation
  • ...
Session 5 - Learning from Data

Data Preprocessing

Process of transforming raw data into a format suitable for machine learning while ensuring data quality and consistency.

Feature Scaling:

  • Standardisation (Z-score)
  • Min-Max Scaling
  • ...
Session 5 - Learning from Data

Data Augmentation

Process of increasing the size and diversity of our datasets.

Increasing Diversity:

  • Adding controlled noise
  • Random rotations
  • ...

Increasing Size:

  • Interpolating between datapoints
  • Domain-specific transformations
  • GANs
  • ...
Session 5 - Learning from Data

Feature Engineering

Process of creating, transforming, and selecting features in our data, combining domain knowledge and creativity.

Creating New Features:

  • Extracting information from existing features
  • Combining existing features
  • Adding domain knowledge
  • ...

Selecting Features:

  • Correlation matrix
  • Decision trees
  • Principal Component Analysis (PCA)
  • ...
Session 5 - Learning from Data

Last Time

How can we ensure the validity of the data that results from the assess stage?

Data Quality
Session 5 - Learning from Data

Last Time

Data Assess Pipeline

How can we guarantee the validity of the data resulting from the assess stage?

Session 5 - Learning from Data

Last Time

This is again a data assess process where we should use:

  • Our understanding of the domain
  • Quality checks (i.e., manual and automated)
  • Visualisation techniques (e.g., PCA)
  • Metrics comparison (e.g., means, variance, etc.)
  • ...

How can we guarantee the validity of the data resulting from the assess stage?

Session 5 - Learning from Data

Statistical Tests

Session 5 - Learning from Data

Statistical Tests

Statistical tests are methods used to analyse data. They help determine whether the results of a study are due to chance or if they reflect a real effect or relationship. These methods enable us to determine the significance of differences or relationships between variables or groups.

Session 5 - Learning from Data

Statistical Tests

Statistical tests are methods used to analyse data. They help determine whether the results of a study are due to chance or if they reflect a real effect or relationship. These methods enable us to determine the significance of differences or relationships between variables or groups.

One idea is to test our data assumptions. For example, if we assume our data is normally distributed, we can perform normality tests.

Session 5 - Learning from Data

Statistical Tests

Normal Distribution Histogram Normal Probability Distribution
Normal Probability Plot - Visnut, CC BY 3.0 , via Wikimedia Commons

One idea is to test our data assumptions. For example, if we assume our data is normally distributed, we can perform normality tests.

Some of these tests are:

Selecting the test depends on our purpose, the nature of our data, and the assumptions of each test.

Session 5 - Learning from Data

Statistical Tests

Statistical tests are methods used to analyse data. They help determine whether the results of a study are due to chance or if they reflect a real effect or relationship. These methods enable us to determine the significance of differences or relationships between variables or groups.

Another idea is to use statistical tests to compare the assessed dataset and the original dataset.

Session 5 - Learning from Data

Statistical Tests

Statistical tests are methods used to analyse data. They help determine whether the results of a study are due to chance or if they reflect a real effect or relationship. These methods enable us to determine the significance of differences or relationships between variables or groups.

The idea is to use statistical tests to compare the assessed dataset and the original dataset.

We can formulate the following hypotheses:

H0 (Null Hypothesis): The assessed dataset has no significant differences from the original.

H1: The augmented dataset has significant differences from the original.

Ideally, we should demonstrate H0

Session 5 - Learning from Data

Statistical Tests

Given the hypotheses set, the following statistical tests can be applied:

  • T-test: To compare the means of the assessed dataset and the original dataset.
  • ANOVA: To compare the means of multiple groups within the datasets.
  • Chi-Square Test: To determine if there is a significant association between categorical variables.
  • Wilcoxon Rank-Sum Test: To compare the distribution of two independent samples.
  • ...

Selecting the test depends on our purpose, the nature of our data, and the assumptions of each test.

The idea is to use statistical tests to compare the assessed dataset and the original dataset.

We can formulate the following hypotheses:

H0 (Null Hypothesis): The assessed dataset has no significant differences from the original.

H1: The augmented dataset has significant differences from the original.

Ideally, we should demonstrate H0

Session 5 - Learning from Data

Statistical Tests

from scipy.stats import ks_2samp
stat, p_value = ks_2samp(y_orig, y_aug)
print(f"KS test p-value: {p_value}")

A high p-value demonstrates H0.

The idea is to use statistical tests to compare the assessed dataset and the original dataset.

We can formulate the following hypotheses:

H0 (Null Hypothesis): The assessed dataset has no significant differences from the original.

H1: The augmented dataset has significant differences from the original.

Ideally, we should demonstrate H0.

Session 5 - Learning from Data

Data Address

Session 5 - Learning from Data

Data Address

After assessing the data (i.e., data assess), we need to use the data to address the problem in question. This process includes implementing a Machine Learning algorithm that creates a Machine Learning model.

Session 5 - Learning from Data

Data Address

Data Assess Pipeline
Session 5 - Learning from Data

Data Address

A Machine Learning algorithm is a set of instructions that are used to train a machine learning model. It defines how the model learns from data and makes predictions or decisions. Linear regression, decision trees, and neural networks are examples of machine learning algorithms.

ML Algorithm
Session 5 - Learning from Data

Data Address

A Machine Learning model is a program that is trained on a dataset and used to make predictions or decisions. The goal is to create a trained model that can generalise well to new, unseen data. For example, a trained model could predict house prices based on new input features.

ML Model
Session 5 - Learning from Data

Data Address

A Machine Learning algorithm uses the training process that goes from a specific set of observations to a general rule (i.e., induction). This process adjusts the Machine Learning model internal parameters to minimise prediction errors. For example, in a linear regression model, the algorithm adjusts the slope and intercept to minimise the difference between predicted and actual values.

Training Process
Session 5 - Learning from Data

Data Address

The parameters of Machine Learning models are adjusted according to the training data (i.e., seen data). For example, when training a model to predict house prices, the training data would include features like square footage, number of bedrooms, and location, along with their actual sale prices. The training data consists of vectors of attribute values.

Training Process
Session 5 - Learning from Data

Data Address

In classification problems, the prediction (i.e., model's output) is one of a finite set of values (e.g., sunny/cloudy/rainy or true/false). In the regression problems, the model's output is a number.

ML Model
Session 5 - Learning from Data

Data Address

Predictions can deviate from the expected values. Prediction errors are quantified by a loss function that indicates to the algorithm how far the prediction is from the target value. This difference is used to update the machine learning model's internal parameters to minimise the prediction errors. Common examples include Mean Squared Error for regression problems and Cross-Entropy for classification problems.

Training Process
Session 5 - Learning from Data

Data Address

Three types of feedback can be part of the inputs in our training process:

  • Supervised Learning: The model is trained on labeled data to learn a mapping between inputs and the corresponding labels, so the model can make predictions on unseen data.
  • Unsupervised Learning: The model is trained on unlabeled data, and it must find patterns in the data. The goal is to identify hidden structures or groupings in the data.
  • Reinforcement Learning: The model learns by interacting with an environment and receiving rewards or penalties. The goal is to learn a policy that maximizes the reward.
Training Process
Session 5 - Learning from Data

Supervised Learning

Session 5 - Learning from Data

Supervised Learning

Session 5 - Learning from Data

Supervised Learning

The agent observes input-output pairs and learns a function that maps from input to output (label).

Session 5 - Learning from Data

Supervised Learning


Given a training set of N example input-output pairs


Session 5 - Learning from Data

Supervised Learning


Given a training set of N example input-output pairs



Each pair was generated by an unknown function f


Session 5 - Learning from Data

Supervised Learning


Given a training set of N example input-output pairs



Each pair was generated by an unknown function f



The goal is to discover a function h that approximates the true function f.

Session 5 - Learning from Data

Supervised Learning


h is called a hypothesis. We need to search for h in a hypotheses space H of possible functions


Session 5 - Learning from Data

Supervised Learning


h is called a hypothesis. We need to search for h in a hypotheses space H of possible functions



A consistent hypothesis maps each input to the corresponding grounded truth


Session 5 - Learning from Data

Supervised Learning


h is called a hypothesis. We need to search for h in a hypotheses space H of possible functions



A consistent hypothesis maps each input to the corresponding grounded truth



We cannot expect exact match to the ground truth. We look for a best-fit function that generalises well.


The hypothesis h accurately predicts the outputs of unseen inputs (i.e., test set).

Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. Future examples will be like past examples (i.e., stationary assumption)

Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. Future examples will be like past examples (i.e., stationary assumption)


Each example has a the same prior probability distribution


Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. Future examples will be like past examples (i.e., stationary assumption)


Each example has a the same prior probability distribution



Each example is independent of previous examples



Examples that satisfy these equations are independent and identically distributed (i.e., iid).

Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. h is optimally fit if it minimises the error rate


Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. h is optimally fit if it minimises the error rate


The error rate is the proportion of times that h produces the wrong output for an example



Session 5 - Learning from Data

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. h is optimally fit if it minimises the error rate


The error rate is the proportion of times that h produces the wrong output for an example



Finding a good hypothesis implies choosing a good hypothesis space H and optimising or finding the best hypothesis h at training.

Session 5 - Learning from Data

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)

Session 5 - Learning from Data

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)


Absolute value of the difference

Session 5 - Learning from Data

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)


Absolute value of the difference


Squared difference

Session 5 - Learning from Data

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)


Absolute value of the difference


Squared difference


Zero-One loss (classification error)

Session 5 - Learning from Data

Supervised Learning


We can generalise the loss by defining the prior probability distribution P(x,y) over examples

Session 5 - Learning from Data

Supervised Learning


We can generalise the loss by defining the prior probability distribution P(x,y) over examples


Generalisation loss for a hypothesis h using loss function L:


Session 5 - Learning from Data

Supervised Learning


We can generalise the loss by defining the prior probability distribution P(x,y) over examples


Generalisation loss for a hypothesis h using loss function L:


The best hypothesis h* is the one with the minimum expected generalisation loss


Session 5 - Learning from Data

Supervised Learning


We can generalise the loss by defining the prior probability distribution P(x,y) over examples


Generalisation loss for a hypothesis h using loss function L:


The best hypothesis h* is the one with the minimum expected generalisation loss


But, P(x,y) is unknown in most cases.

Session 5 - Learning from Data

Supervised Learning


As P(x,y) is unknown, we can only estimate an empirical loss on a set of examples E of size N

Session 5 - Learning from Data

Supervised Learning


As P(x,y) is unknown, we can only estimate an empirical loss on a set of examples E of size N


Empirical loss for a hypothesis h using loss function L:


Session 5 - Learning from Data

Supervised Learning


As P(x,y) is unknown, we can only estimate an empirical loss on a set of examples E of size N


Empirical loss for a hypothesis h using loss function L:



The best hypothesis h* is the one with the minimum empirical loss


Session 5 - Learning from Data

Supervised Learning

Underfitted Model
Underfitted Model - AAStein, CC BY-SA 4.0 , via Wikimedia Commons

Underfitting occurs when our hypothesis space H is too simple to capture the true function f

Session 5 - Learning from Data

Supervised Learning

Underfitted Model
Underfitted Model - AAStein, CC BY-SA 4.0 , via Wikimedia Commons

Underfitting occurs when our hypothesis space H is too simple to capture the true function f


Even the best hypothesis h* in H will have high error because:



where epsilon is some small positive number.

Session 5 - Learning from Data

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Overfitting occurs when our hypothesis space H is too complex, leading to:

Session 5 - Learning from Data

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Overfitting occurs when our hypothesis space H is too complex, leading to:


Low empirical loss but high generalisation loss:



The hypothesis h* memorizes the training examples instead of learning the true function f.

Session 5 - Learning from Data

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting

Session 5 - Learning from Data

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting


Choosing the complexity function depends on the hypotheses space H. A good example for polynomials will be a function that returns the sum of the squares of coefficients.

Session 5 - Learning from Data

Supervised Learning

Given the new cost function


Session 5 - Learning from Data

Supervised Learning

Given the new cost function



The best hypothesis h* is the one that minimises the cost:


where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.

Session 5 - Learning from Data

Supervised Learning

Given the new cost function



The best hypothesis h* is the one that minimises the cost:


where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.


Alternatives to mitigate overfitting are feature selection, hyperparameter tuning, splitting dataset in training, validation, and test data (e.g., cross validation algorithm).

Session 5 - Learning from Data

Regression

Session 5 - Learning from Data

Regression

Session 5 - Learning from Data

Regression

The regression problem involves predicting a continuous numerical value. Regression models approximate a function f that maps input features to a continuous output.

Session 5 - Learning from Data

Regression

The hypotheses space H includes linear functions of continuous-valued inputs and outputs.

Session 5 - Learning from Data

Regression Models

The hypotheses space H includes linear functions of continuous-valued inputs and outputs.


The simplest example is "fitting a straight line". The model learns the coefficients W


Session 5 - Learning from Data

Regression Models

The hypotheses space H includes linear functions of continuous-valued inputs and outputs.


The simplest example is "fitting a straight line". The model learns the coefficients W


Hypothesis Space for Regression
Example functions described using a linear model.
Session 5 - Learning from Data

Regression Models

Hypothesis Space for Regression
Example functions described using a linear model.
Training Data Set
Training dataset.
Regression Fit
Linear regression fit.
Session 5 - Learning from Data

Regression Models

The simplest example is "fitting a straight line". The model learns the coefficients W



Finding the h that best fits the data is called linear regression.


Session 5 - Learning from Data

Regression Models

The simplest example is "fitting a straight line". The model learns the coefficients W



Finding the h that best fits the data is called linear regression.



Finding the values of the weights w0 and w1 that minimise the empirical loss L2 (Squared Error).

Session 5 - Learning from Data

Regression Models

Given the Loss(hw)


We want to find w*

Session 5 - Learning from Data

Regression Models

Given the Loss(hw)


We want to find w*


We know that the loss is minimised when its partial derivatives with respect to w0 and w1 are zero.


Session 5 - Learning from Data

Regression Models

We know that the loss is minimised when its partial derivatives with respect to w0 and w1 are zero.




Session 5 - Learning from Data

Regression Models

We know that the loss is minimised when its partial derivatives with respect to w0 and w1 are zero.




Solving the partial derivatives


Session 5 - Learning from Data

Regression Models

Convex Function
Weights Space - Convex Loss Function.
Session 5 - Learning from Data

Regression Models

Convex Function
Weights Space - Convex Loss Function.
Non Convex Function
Non Convex Function - Zachary kaplan, CC BY-SA 4.0 , via Wikimedia Commons.
Session 5 - Learning from Data

Regression Models

Convex Function
Weights Space - Convex Loss Function.
Non Convex Function
Non Convex Function - Zachary kaplan, CC BY-SA 4.0 , via Wikimedia Commons.
Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 5 - Learning from Data

Regression Models

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

We choose any starting point and then compute an estimate of the gradient and move a small amount in the steepest downhill direction, repeating until we converge on a point in the weight space with (local) minima loss.


Gradient Descent Algorithm:
Initialize w randomly
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

The size of the step is given by the parameter α, which regulates the behaviour of the gradient descent algorithm. This is a hyperparameter of the regression model we are training, usually called learning rate.

Session 5 - Learning from Data

Regression Models

Following with our "straight line" example, the update rule for w0 and w1 is


Session 5 - Learning from Data

Regression Models

Following with our "straight line" example, the update rule for w0 and w1 is


The loss function is represented as a composition of functions.


Session 5 - Learning from Data

Regression Models

Following with our "straight line" example, the update rule for w0 and w1 is


The loss function is represented as a composition of functions.


We need to differentiate the loss function step by step using the chain rule to compute the gradients.


Session 5 - Learning from Data

Regression Models

Applying the chain rule to the loss function



Session 5 - Learning from Data

Regression Models

Applying the chain rule to the loss function



Applying this to w0 and w1


Session 5 - Learning from Data

Regression Models

The update rule for each weight in the "straight line" example is (folding 2 into α)


Session 5 - Learning from Data

Regression Models

The update rule for each weight in the "straight line" example is (folding 2 into α)


For N training examples

These updates constitute the batch gradient descent. For the "straight line", this gradient descent is deterministic.


An epoch is defined as one complete pass through the entire dataset during the training process. Multiple epochs are needed for the model to converge to an optimal solution. Not enough epochs could generate underfitting. Too many epochs could generate overfitting. Another hyperparameter for the model.

Session 5 - Learning from Data

Regression Models

Batch Gradient Descent

The algorithm updates W using the entire dataset in each iteration.

import numpy as np
def batch_gradient_descent(X, y, alpha=0.01, epochs=1000):
    """
    X: numpy array of shape (n_samples, n_features)
    y: numpy array of shape (n_samples,)
    alpha: learning rate
    epochs: number of passes over the data
    Returns: w (numpy array of shape (n_features,))
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    for epoch in range(epochs):
        grad = [0] * len(w)
        for i in range(n_samples):
            x_i = X[i]
            y_i = y[i]
            prediction = dot(w, x_i)
            error = prediction - y_i
            for j in range(len(w)):
                grad[j] += error * x_i[j]
        for j in range(len(w)):
            grad[j] /= len(n_samples)
            w[j] -= alpha * grad[j]
    return w
Session 5 - Learning from Data

Regression Models

Stochastic Gradient Descent

The algorithm updates W after computing the gradient for each training example.

import numpy as np
def stochastic_gradient_descent(X, y, alpha=0.01, epochs=1000):
    """
    X: numpy array of shape (n_samples, n_features)
    y: numpy array of shape (n_samples,)
    alpha: learning rate
    epochs: number of passes over the data
    Returns: w (numpy array of shape (n_features,))
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    for epoch in range(epochs):
        for i in range(n_samples):
            x_i = X[i]
            y_i = y[i]
            prediction = np.dot(w, x_i)
            error = prediction - y_i
            for j in range(n_features):
                w[j] -= alpha * error * x_i[j]
    return w
Session 5 - Learning from Data

Regression Models

Mini-batch Gradient Descent

The algorithm updates W after computing the gradient for a small batch of training examples (batch size m). The batch size is another hyperparameter.

import numpy as np
def mini_batch_gradient_descent(X, y, alpha=0.01, epochs=1000, batch_size=32):
    """
    X: numpy array of shape (n_samples, n_features)
    y: numpy array of shape (n_samples,)
    alpha: learning rate
    epochs: number of passes over the data
    batch_size: number of samples per batch
    Returns: w (numpy array of shape (n_features,))
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    for epoch in range(epochs):
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        for start in range(0, n_samples, batch_size):
            end = start + batch_size
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]
            y_pred = X_batch @ w
            error = y_pred - y_batch
            grad = X_batch.T @ error / len(X_batch)
            w -= alpha * grad
    return w
Session 5 - Learning from Data

Regression Models

Regression Fit
Linear regression fit.

Linear Regression

Training Dataset

Hypothesis Space: All possible linear functions of continuous-valued inputs and outputs.

Hypothesis:

Loss Function:

Cost Function:

Session 5 - Learning from Data

Regression Models

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Linear Regression

Analytical Solution:

Gradient Descent Algorithm:


Initialize w randomly
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

Hyperparameters: Learning rate, number of epochs, and batch size.

Session 5 - Learning from Data

Conclusions

Overview

  • Statistical tests
  • Supervised Learning
  • Regression Problem
  • Univariate Linear Regression
  • Gradient Descent
Session 5 - Learning from Data

Conclusions

Overview

  • Statistical tests
  • Supervised Learning
  • Regression Problem
  • Univariate Linear Regression
  • Gradient Descent

Next Time

  • Multivariate Linear Regression
  • Linear Classification
  • Non-parametric Models
Session 5 - Learning from Data

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true