"Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer." (Cambridge Dictionary, 2025)
Data Availability
Each ML project is unique and data can be scarce
Data Usability
ML Models require data in a specific format
Data Quality Issues
ML Models are data-driven
Data Bias and Fairness
ML Models are data-driven
Data Complexity
Complex and dynamic systems
Focus on Operations
Focus on Operations
Focus on Operations
Focus on Operations
The Data Dichotomy
“While data-driven systems are about exposing data, service-oriented architectures are about hiding data.” (Stopford, 2016)
The Data Dichotomy
“While data-driven systems are about exposing data, service-oriented architectures are about hiding data.” (Stopford, 2016)
We need to design systems prioritising data!
Data-Oriented Architectures
Data-First Systems
Data-Oriented Architectures
Data-First Systems
Data-Oriented Architectures
Prioritise Decentralisation
Data-Oriented Architectures
Openness
Does the data even exist?
No, it does not exist!
Primary Data Collection
No, it does not exist!
Primary Data Collection
Yes, it does exist!
Secondary Data Collection
Yes, it does exist!
Secondary Data Collection
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
The dataset creation is a critical process that cannot be skipped if we want to implement Data Science or ML projects
We can assume datasets exist, but they will likely enable toy projects not relevant for our society
Using your own files
The most common are comma separated values files (i.e., CSV files).
Using your own files
The most common are comma separated values files (i.e., CSV files).
import pandas as pd
dataset = pd.read_csv('path/to/your/dataset.csv')
print("Dataset shape:", dataset.shape)
print("\nFirst few rows:")
print(dataset.head())
Using built-in datasets
Different repositories are available online. For example, OpenML or Tensorflow datasets
Using built-in datasets
Different datasets repositories are available online. For example, OpenML, an open source platform for sharing datasets and experiments.
from sklearn.datasets import fetch_openml
iris = fetch_openml(name='iris', version=1, as_frame=True)
print("Iris dataset shape:", iris.data.shape)
print("\nFirst few rows:")
print(iris.data.head())
Using built-in datasets
from sklearn.datasets import fetch_openml
iris = fetch_openml(name='iris', version=1, as_frame=True)
print("Iris dataset shape:", iris.data.shape)
print("\nFirst few rows:")
print(iris.data.head())
Using built-in datasets
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
print("Training data shape:", x_train.shape)
print("Test data shape:", x_test.shape)
Accessing data via APIs
import requests
url = '<url_of_the_dataset>'
response = requests.get(url)
if response.status_code == 200:
with open("." + file_name_part_1, "wb") as file:
file.write(response.content)
Accessing data via APIs
import requests
url = 'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020-part1.csv'
response = requests.get(url)
if response.status_code == 200:
with open("." + file_name_part_1, "wb") as file:
file.write(response.content)
Accessing data via APIs
import requests
import pandas as pd
url = 'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020-part1.csv'
response = requests.get(url)
if response.status_code == 200:
with open("." + file_name_part_1, "wb") as file:
file.write(response.content)
dataset = pd.read_csv('pp-2020-part1.csv')
print("Dataset shape:", dataset.shape)
print("\nFirst few rows:")
print(dataset.head())
Accessing data via APIs
import requests
import pandas as pd
import zipfile
import io
url = 'https://www.getthedata.com/downloads/open_postcode_geo.csv.zip'
response = requests.get(url)
if response.status_code == 200:
with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
zip_ref.extractall('open_postcode_geo')
dataset = pd.read_csv('open_postcode_geo/open_postcode_geo.csv')
print("Dataset shape:", dataset.shape)
print("\nFirst few rows:")
print(dataset.head())
Accessing data via APIs
%pip install osmnx
Accessing data via APIs
import osmnx as ox
import pandas as pd
place = "Pasto, Nariño, Colombia"
pois = ox.features_from_place(place, tags={'amenity': True})
pois_df = pd.DataFrame(pois)
print(f"Number of POIs found: {len(pois_df)}")
print("\nSample of POIs:")
print(pois_df[['amenity', 'name']].head())
pois_df.to_csv('pasto_pois.csv', index=False)
Accessing data via APIs
import osmnx as ox
import pandas as pd
place = "Pasto, Nariño, Colombia"
buildings = ox.features_from_place(place, tags={'building': True})
buildings.plot()
Accessing data via APIs
Joining datasets
We can join multiple datasets to improve our data access
For example, the postcode dataset can enrich the UK Price Paid data by adding coordinates information. We should join these datasets using the common aspects between them (i.e., postcode).
Joining datasets
We can join multiple datasets to improve our data access
For example, the postcode dataset can enrich the UK Price Paid data by adding coordinates information. We should join these datasets using the common aspects between them (i.e., postcode).
price_paid = pd.read_csv('pp-2020-part1.csv')
postcodes = pd.read_csv('open_postcode_geo/open_postcode_geo.csv')
merged_data = pd.merge(
price_paid,
postcodes,
on='postcode',
how='inner'
)
print("Original Price Paid dataset shape:", price_paid.shape)
print("Original Postcodes dataset shape:", postcodes.shape)
print("Merged dataset shape:", merged_data.shape)
print("\nSample of merged data:")
print(merged_data[['postcode', 'price', 'latitude', 'longitude']].head())
merged_data.to_csv('price_paid_with_coordinates.csv', index=False)
Web scraping datasets
Web scraping datasets
from bs4 import BeautifulSoup
import time
try:
url = '<url_of_the_website>'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
data = [p.text for p in soup.find_all('p')]
return data
except Exception as e:
print(f"Error scraping data: {str(e)}")
return None
Creating synthetic datasets
Creating synthetic datasets
import numpy as np
X = np.random.randn(n_samples, n_features)
y = np.zeros(n_samples)
for i in range(n_samples):
if X[i, 0] + X[i, 1] > 0:
y[i] = 0
elif X[i, 2] * X[i, 3] > 0:
y[i] = 1
else:
y[i] = 2
_script: true
This script will only execute in HTML slides
_script: true