Session 4 - Data Quality

Data Quality

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 4 - Data Quality

Last Time

Session 4 - Data Quality

ML Definition

Session 4 - Data Quality

Data

Session 4 - Data Quality

Data

"Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer." (Cambridge Dictionary, 2025)

Session 4 - Data Quality

Data Challenges

Session 4 - Data Quality

Data Challenges

No Data

Data Availability

Each ML project is unique and data can be scarce

  • Specific requirements
  • Domain
  • Location
  • Space
  • ...
Session 4 - Data Quality

Data Challenges

Data Usability

ML Models require data in a specific format

  • Structured vs Non structured data
  • Sparse data
  • Legal regulations and privacy
  • Storage technologies
  • ...
Data Usability
Session 4 - Data Quality

Data Challenges

Data Quality Issues

ML Models are data-driven

  • Missing values
  • Duplicate records
  • Incorrect data
  • Outlier values
  • ...
Data Quality
Session 4 - Data Quality

Data Challenges

Data Theatre

Data Bias and Fairness

ML Models are data-driven

  • Sampling bias
  • Historical bias
  • Measurement bias
  • Algorithmic bias
  • Implicit bias
  • ...
Session 4 - Data Quality

Data Challenges

Complex Systems

Data Complexity

Complex and dynamic systems

  • Large systems
  • Data generation speed
  • Current systems architectures
  • Interpretability issues
  • Intellectual debt
  • ...
Session 4 - Data Quality

The Data Science Process

Session 4 - Data Quality

The Data Science Process

Data Science Process
Session 4 - Data Quality

The Data Science Process

Data Science Process
Session 4 - Data Quality

The Data Science Process

Data Science Process
Session 4 - Data Quality

Data Quality

Session 4 - Data Quality

Data Quality

Data quality refers to the state of data in terms of its fitness for a purpose

Data Quality
Session 4 - Data Quality

Data Quality

Data quality refers to the state of data in terms of its fitness for a purpose

This is a multi-dimensional concept

  • Accuracy
  • Completeness
  • Uniqueness
  • Consistency
  • Timeliness
  • Validity
Session 4 - Data Quality

Data Quality

  • Data reflects reality
  • Dynamic and changing data
  • Confident decision-making

Accuracy

Session 4 - Data Quality

Data Quality

Metrics

  • Error rate
  • Precision
  • Recall
  • F1 score

Accuracy

Session 4 - Data Quality

Data Quality

  • All the data is available
  • Critical data vs Optional data
  • Impact of the missing data

Completeness

Session 4 - Data Quality

Data Quality

Metrics

  • Missing value ratio
  • Record completeness
  • Attribute completeness

Completeness

Session 4 - Data Quality

Data Quality

  • Duplicated records
  • Combining datasets
  • Trusted data

Uniqueness

Session 4 - Data Quality

Data Quality

Metrics

  • Duplicate ratio
  • Unique value ratio
  • Entity resolution rate

Uniqueness

Session 4 - Data Quality

Data Quality

  • Data values do not conflict
  • Link data from multiple sources
  • Data usability

Consistency

Session 4 - Data Quality

Data Quality

Metrics

  • Format consistency score
  • Value consistency ratio
  • Cross-field consistency rate

Consistency

Session 4 - Data Quality

Data Quality

  • Data available when expected
  • Context-dependent
  • Added data value

Timeliness

Session 4 - Data Quality

Data Quality

Metrics

  • Data freshness
  • Update frequency
  • Processing delay

Timeliness

Session 4 - Data Quality

Data Quality

  • Data conforms to a format
  • Heterogeneous non-structured data
  • Automation

Validity

Session 4 - Data Quality

Data Quality

Metrics

  • Format compliance rate
  • Schema validation score
  • Business rule compliance

Validity

Session 4 - Data Quality

Data Quality

Data Usability
Session 4 - Data Quality

Data Quality

Data Usability

Poor data quality can lead to:

  • Inaccurate model predictions and reduced reliability
  • Biased results
  • Reduced generalisability
  • Increased training time
  • Higher costs
  • Intellectual debt
  • ...
Session 4 - Data Quality

Data Quality

1. Missing Values
  • Incomplete records
  • Null or NaN values
  • Empty fields
2. Inconsistencies
  • Format variations
  • Unit mismatches
  • Naming conventions
3. Outliers
  • Extreme values
  • Measurement errors
  • Data entry mistakes
Session 4 - Data Quality

Data Quality

1. Missing Values
  • Incomplete records
  • Null or NaN values
  • Empty fields
2. Inconsistencies
  • Format variations
  • Unit mismatches
  • Naming conventions
3. Outliers
  • Extreme values
  • Measurement errors
  • Data entry mistakes
4. Noise
  • Random variations
  • Measurement errors
  • Background interference
5. Bias
  • Sampling bias
  • Selection bias
  • Measurement bias
  • ...
Session 4 - Data Quality

Data Assess

Session 4 - Data Quality

Data Assess

Data Science Process
Session 4 - Data Quality

Data Assess

After collecting the data (i.e., data access), we need to perform a data assessment process to understand the data, identify and mitigate data quality issues, uncover patterns, and gain insights.

Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

ML Pipelines vs ML-based Systems

Data Assess Pipeline
ML-based System
Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

Data Cleaning

Process of detecting and correcting (or removing) corrupt or inaccurate records.

Session 4 - Data Quality

Data Cleaning

Missing Values

Missing data points in the dataset

  • Data collection errors
  • System failures
  • Information not available
  • Data entry mistakes
Session 4 - Data Quality

Data Cleaning

Missing Values

Missing data points in the dataset

  • Data collection errors
  • System failures
  • Information not available
  • Data entry mistakes
Missing Values
Session 4 - Data Quality

Data Cleaning

Missing Values

Missing data points in the dataset

  • Deletion: Remove rows or columns with missing values
  • Imputation: Fill missing values with estimated values
  • Advanced techniques: Use machine learning models to predict missing values
Missing Values
Session 4 - Data Quality

Data Cleaning

from sklearn.linear_model import LinearRegression
features_for_age = ['Pclass', 'SibSp', 'Parch', 'Fare']
X_train = titanic_data.dropna(subset=['Age'])[features_for_age]
y_train = titanic_data.dropna(subset=['Age'])['Age']
reg_imputer = LinearRegression()
reg_imputer.fit(X_train, y_train)
X_missing = titanic_data[titanic_data['Age'].isnull()][features_for_age]
predicted_ages = reg_imputer.predict(X_missing)
titanic_data_reg = titanic_data.copy()
titanic_data_reg.loc[titanic_data_reg['Age'].isnull(), 'Age'] = predicted_ages
titanic_data['Age_Regression'] = titanic_data_reg['Age']
Missing Values
Session 4 - Data Quality

Data Cleaning

Outliers

Data points that significantly deviate from the rest of the data

  • Measurement errors
  • Data entry mistakes
  • Rare but valid observations
  • System malfunctions
Session 4 - Data Quality

Data Cleaning

Outliers

Outliers

Data points that significantly deviate from the rest of the data

  • Measurement errors
  • Data entry mistakes
  • Rare but valid observations
  • System malfunctions
Session 4 - Data Quality

Data Cleaning

Outliers

Outliers

Data points that significantly deviate from the rest of the data

  • Capping: Limit values to a range
  • Log Transformation: Reduce the impact of extreme values
Session 4 - Data Quality

Data Cleaning

Outliers
def cap_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column + '_capped'] = df[column].clip(lower=lower_bound, upper=upper_bound)
    return df
Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

Data Preprocessing

Session 4 - Data Quality

Data Preprocessing

Process of transforming raw data into a format suitable for machine learning while ensuring data quality and consistency.

Session 4 - Data Quality

Data Preprocessing

Feature Scaling

Transforming numerical features to a common scale

  • All features contribute equally to the model
  • Algorithms converge faster
  • Features with larger scales do not dominate the model
Session 4 - Data Quality

Data Preprocessing

Feature Scaling

Standardisation (Z-score): Centers data around 0 with unit variance

: data point
: dataset mean
: dataset standard deviation



Session 4 - Data Quality

Data Preprocessing

scaler = StandardScaler()
cols = [col + '_st' for col in num_feat]
data[cols] = scaler.fit_transform(data[num_feat])

: data point
: dataset mean
: dataset standard deviation



Session 4 - Data Quality

Data Preprocessing

Feature Scaling

Min-Max scaling: Scales data to a fixed range [0,1]

: data point
: the minimum value of the feature
: the maximum value of the feature



Session 4 - Data Quality

Data Preprocessing

scaler = MinMaxScaler()
cols = [col + '_minmax' for col in num_feat]
data[cols] = scaler.fit_transform(data[num_feat])

: data point
: the minimum value of the feature
: the maximum value of the feature



Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

Data Augmentation

Session 4 - Data Quality

Data Augmentation

Process of increasing the size and diversity of our datasets.

Session 4 - Data Quality

Data Augmentation

Increasing diversity to make our models more robust

  • Adding controlled noise
  • Random rotations
  • ...
Session 4 - Data Quality

Data Augmentation

Increasing diversity to make our models more robust

  • Adding controlled noise
  • Random rotations
  • ...
Standard Normal Distribution
Inductiveload, Public domain, via Wikimedia Commons
Session 4 - Data Quality

Data Augmentation

def augment_image(img, angle_range=(-15, 15)):
    angle = np.random.uniform(angle_range[0], angle_range[1])
    rot = rotate(img.reshape(20, 20), angle, mode='edge')
    noise = np.random.normal(0, 0.05, rot.shape)
    aug = rot + (noise * (rot > 0.1))
    return aug.flatten()
Standard Normal Distribution
Inductiveload, Public domain, via Wikimedia Commons
Session 4 - Data Quality

Data Augmentation

def augment_image(img, angle_range=(-15, 15)):
    angle = np.random.uniform(angle_range[0], angle_range[1])
    rot = rotate(img.reshape(20, 20), angle, mode='edge')
    noise = np.random.normal(0, 0.05, rot.shape)
    aug = rot + (noise * (rot > 0.1))
    return aug.flatten()
Data Augmentation
Session 4 - Data Quality

Data Augmentation

Increasing the size of our dataset to improve model generalisation

  • Interpolating between existing data points
  • Applying domain-specific transformations
  • Generating synthetic data using GANs
  • ...
Session 4 - Data Quality

Data Augmentation

def numerical_smote(data, k=5):
    aug_data = []
    for i in range(len(data)):
        uniq_values = np.unique(data[data != data[i]])
        dists = np.abs(uniq_values - data[i])
        k_neigs = uniq_values[np.argsort(dists)[:k]]
        for neig in k_neigs:
            sample = data[i] + np.random.random() * (neig - data[i])
            aug_data.append(sample)
    return np.array(aug_data)

SMOTE (Synthetic Minority Over-sampling Technique)

  • Using the k-nearest neighbours
  • Interpolation between the original data point and the neighbour
Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

Feature Engineering

Process of creating, transforming, and selecting features in our data, combining domain knowledge and creativity.

Session 4 - Data Quality

Feature Engineering

Creating new features in our data can help to capture important patterns or relations in the data

  • Extracting information from existing features
  • Combining existing features
  • Adding domain knowledge rules
  • ...
Session 4 - Data Quality

Feature Engineering

Creating new features in our data can help to capture important patterns or relations in the data

  • Extracting information from existing features
  • Combining existing features
  • Adding domain knowledge rules
  • ...
data['AgeGroup'] = pd.cut(data['Age'], 
                    bins=[0, 12, 18, 35, 60],
                    labels=['Child', 
                            'Teenager', 
                            'Young Adult', 
                            'Adult'])
Session 4 - Data Quality

Feature Engineering

Selecting the most important features in our data reduces dimensionality, prevents overfitting, improves interpretability, and reduces training time

  • Correlation Matrix with Heatmap
  • Decision Trees
  • Principal Component Analysis (PCA)
  • ...
Session 4 - Data Quality

Feature Engineering

Correlation Matrix

Selecting the most important features in our data reduces dimensionality, prevents overfitting, improves interpretability, and reduces training time

  • Correlation Matrix with Heatmap
  • Decision Trees
  • Principal Component Analysis (PCA)
  • ...
Session 4 - Data Quality

Feature Engineering

Correlation Matrix

The matrix helps to visualise the correlation between features

Session 4 - Data Quality

Feature Engineering

Correlation Matrix
corr_matrix = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, center=0)
plt.title('Feature Correlation Matrix')
plt.show()
Session 4 - Data Quality

Feature Engineering

Feature Importance

A decision tree helps in identifying the most important features contributing to the prediction. This method splits the dataset into subsets based on the feature that results in the largest information gain. At the end of the process, we get the importance of each feature

Session 4 - Data Quality

Feature Engineering

Feature Importance
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X, y)
Session 4 - Data Quality

Feature Engineering

Titanic Tree
Session 4 - Data Quality

Feature Engineering

Feature Importance

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a set of correlated features into a set of linearly uncorrelated features called principal components. These components capture the most variance in the data, allowing for a reduction in the number of features while retaining most of the information.

Session 4 - Data Quality

Feature Engineering

Feature Importance
from sklearn.decomposition import PCA
pca = PCA()
X_pca_transformed = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
Session 4 - Data Quality

Feature Engineering

Feature Importance
from sklearn.decomposition import PCA
pca = PCA()
X_pca_transformed = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
Session 4 - Data Quality

Data Assess

Data Assess Pipeline
Session 4 - Data Quality

Conclusions

Overview

  • Data Quality
  • Data Assess
  • ML Pipeline
  • ML Pipeline vs ML-based Systems
  • Data Cleaning and Preprocessing
  • Data Augmentation
  • Feature Engineering
Data Assess Pipeline
Session 4 - Data Quality

Conclusions

Next Time

  • Data Address
  • Linear Regression
  • Clustering
Data Assess Pipeline
Session 4 - Data Quality

Many Thanks!

chc79@cam.ac.uk

SLIDES:

_script: true

This script will only execute in HTML slides

_script: true