"Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer." (Cambridge Dictionary, 2025)
Data Availability
Each ML project is unique and data can be scarce
Data Usability
ML Models require data in a specific format
Data Quality Issues
ML Models are data-driven
Data Bias and Fairness
ML Models are data-driven
Data Complexity
Complex and dynamic systems
Data quality refers to the state of data in terms of its fitness for a purpose
Data quality refers to the state of data in terms of its fitness for a purpose
This is a multi-dimensional concept
Accuracy
Metrics
Accuracy
Completeness
Metrics
Completeness
Uniqueness
Metrics
Uniqueness
Consistency
Metrics
Consistency
Timeliness
Metrics
Timeliness
Validity
Metrics
Validity
Poor data quality can lead to:
After collecting the data (i.e., data access), we need to perform a data assessment process to understand the data, identify and mitigate data quality issues, uncover patterns, and gain insights.
Process of detecting and correcting (or removing) corrupt or inaccurate records.
Missing Values
Missing data points in the dataset
Missing Values
Missing data points in the dataset
Missing Values
Missing data points in the dataset
from sklearn.linear_model import LinearRegression
features_for_age = ['Pclass', 'SibSp', 'Parch', 'Fare']
X_train = titanic_data.dropna(subset=['Age'])[features_for_age]
y_train = titanic_data.dropna(subset=['Age'])['Age']
reg_imputer = LinearRegression()
reg_imputer.fit(X_train, y_train)
X_missing = titanic_data[titanic_data['Age'].isnull()][features_for_age]
predicted_ages = reg_imputer.predict(X_missing)
titanic_data_reg = titanic_data.copy()
titanic_data_reg.loc[titanic_data_reg['Age'].isnull(), 'Age'] = predicted_ages
titanic_data['Age_Regression'] = titanic_data_reg['Age']
Outliers
Data points that significantly deviate from the rest of the data
Outliers
Data points that significantly deviate from the rest of the data
Outliers
Data points that significantly deviate from the rest of the data
def cap_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column + '_capped'] = df[column].clip(lower=lower_bound, upper=upper_bound)
return df
Process of transforming raw data into a format suitable for machine learning while ensuring data quality and consistency.
Feature Scaling
Transforming numerical features to a common scale
Feature Scaling
Standardisation (Z-score): Centers data around 0 with unit variance
scaler = StandardScaler()
cols = [col + '_st' for col in num_feat]
data[cols] = scaler.fit_transform(data[num_feat])
Feature Scaling
Min-Max scaling: Scales data to a fixed range [0,1]
scaler = MinMaxScaler()
cols = [col + '_minmax' for col in num_feat]
data[cols] = scaler.fit_transform(data[num_feat])
Process of increasing the size and diversity of our datasets.
Increasing diversity to make our models more robust
Increasing diversity to make our models more robust
def augment_image(img, angle_range=(-15, 15)):
angle = np.random.uniform(angle_range[0], angle_range[1])
rot = rotate(img.reshape(20, 20), angle, mode='edge')
noise = np.random.normal(0, 0.05, rot.shape)
aug = rot + (noise * (rot > 0.1))
return aug.flatten()
def augment_image(img, angle_range=(-15, 15)):
angle = np.random.uniform(angle_range[0], angle_range[1])
rot = rotate(img.reshape(20, 20), angle, mode='edge')
noise = np.random.normal(0, 0.05, rot.shape)
aug = rot + (noise * (rot > 0.1))
return aug.flatten()
Increasing the size of our dataset to improve model generalisation
def numerical_smote(data, k=5):
aug_data = []
for i in range(len(data)):
uniq_values = np.unique(data[data != data[i]])
dists = np.abs(uniq_values - data[i])
k_neigs = uniq_values[np.argsort(dists)[:k]]
for neig in k_neigs:
sample = data[i] + np.random.random() * (neig - data[i])
aug_data.append(sample)
return np.array(aug_data)
SMOTE (Synthetic Minority Over-sampling Technique)
Process of creating, transforming, and selecting features in our data, combining domain knowledge and creativity.
Creating new features in our data can help to capture important patterns or relations in the data
Creating new features in our data can help to capture important patterns or relations in the data
data['AgeGroup'] = pd.cut(data['Age'],
bins=[0, 12, 18, 35, 60],
labels=['Child',
'Teenager',
'Young Adult',
'Adult'])
Selecting the most important features in our data reduces dimensionality, prevents overfitting, improves interpretability, and reduces training time
Selecting the most important features in our data reduces dimensionality, prevents overfitting, improves interpretability, and reduces training time
The matrix helps to visualise the correlation between features
corr_matrix = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, center=0)
plt.title('Feature Correlation Matrix')
plt.show()
A decision tree helps in identifying the most important features contributing to the prediction. This method splits the dataset into subsets based on the feature that results in the largest information gain. At the end of the process, we get the importance of each feature
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X, y)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a set of correlated features into a set of linearly uncorrelated features called principal components. These components capture the most variance in the data, allowing for a reduction in the number of features while retaining most of the information.
from sklearn.decomposition import PCA
pca = PCA()
X_pca_transformed = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
from sklearn.decomposition import PCA
pca = PCA()
X_pca_transformed = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
Overview
Next Time
SLIDES:
_script: true
This script will only execute in HTML slides
_script: true