Big data in the 2000s
Large and complex datasets improve models' statistical power.
Big data developments
In this course, we explore situations where at least one dimension of big data forces a non-trivial choice across the data management process.
Technocentric view
Start from the technologies and tools, then look for a problem that fits the tool.
Technocentric view
Questions before tools:
Talk or observe the system's stakeholders
MLTRL - Technology Readiness Levels for Machine Learning Systems (Lavin et al., 2022)
Course project
Pick one decision option
Reference question: What does our pipeline say about labour patterns for a defined population, and what would we and would we not recommend on that basis? Course project definition (PDF)
Cloud setup (client-server)
SOA is a design pattern in which services are provided between components, through a communication protocol over a network.
Microservices are an architectural style that structures an application as a collection of small, autonomous services. Each microservice is self-contained and exposes a business capability, which is implemented by an object (i.e., OOP).
The concept of "Everything as a Service" (XaaS) extends the principles of SOA and microservices by offering comprehensive services over the internet. XaaS encompasses a wide range of services, including infrastructure, platforms, and software.
AI as a Service (AIaaS) enables us to access and expose AI capabilities over the internet. We can integrate AI tools such as machine learning models, natural language processing, and computer vision into our applications leveraging SOA and microservices features.
from flask import Flask, request, jsonify
app = Flask(__name__)
class SentimentAnalysisService:
def __init__(self, model):
self.model = model
def analyze_sentiment(self, text):
sentiment_score = self.model.predict(text)
if sentiment_score > 0.5:
return "Positive"
elif sentiment_score < -0.5:
return "Negative"
else:
return "Neutral"
...
@app.route('/analyze', methods=['POST'])
def analyze():
data = request.get_json()
text_to_analyze = data.get('text', '')
sentiment = service.analyze_sentiment(text_to_analyze)
return jsonify({'sentiment': sentiment})
...
from flask import Flask, request, jsonify
app = Flask(__name__)
class SentimentAnalysisService:
def __init__(self, model):
self.model = model
def analyze_sentiment(self, text):
sentiment_score = self.model.predict(text)
if sentiment_score > 0.5:
return "Positive"
elif sentiment_score < -0.5:
return "Negative"
else:
return "Neutral"
...
@app.route('/analyze', methods=['POST'])
def analyze():
data = request.get_json()
text_to_analyze = data.get('text', '')
sentiment = service.analyze_sentiment(text_to_analyze)
return jsonify({'sentiment': sentiment})
...
Focus on Operations:
Data is secondary and hidden behind services' interfaces, making data collection difficult
Centralised deployments are not sustainable and threatens data privacy and ownership
Data-Orientated Architectures make data available by design facilitating monitoring and maintenance. Decentralisation supports local data processing, reducing latency and improving privacy by respecting data ownership. Openness enables managing resource-constrained environments by exploiting the computing power of everyday devices (Cabrera et al., 2025).
Edge computing
Does the data even exist?
No, it does not exist!
Primary Data Collection
No, it does not exist!
Primary Data Collection
Yes, it does exist!
Secondary Data Collection
Yes, it does exist!
Secondary Data Collection
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
Crash Map Kampala
The dataset creation is a critical process that cannot be skipped if we want to implement a big data project. We can assume datasets exist, but they will likely enable toy projects not relevant for our society
Gran Encuesta Integrada de Hogares (GEIH)
Portal: microdatos.dane.gov.co
Accessing GEIH via Web Crawling
download_url = (
f"https://microdatos.dane.gov.co/index.php/catalog/"
f"{CATALOG_ID}/download/{FILE_ID}"
)
with requests.get(download_url, stream=True, timeout=600) as resp:
resp.raise_for_status()
with zip_path.open("wb") as fh:
for chunk in resp.iter_content(chunk_size=1 << 20):
if chunk:
fh.write(chunk)
OpenStreetMap
Query OSM with Overpass
def overpass_post(query: str, *, timeout: int = 90) -> requests.Response:
for url in OVERPASS_ENDPOINTS:
r = requests.post(
url,
data={"data": query},
headers=OVERPASS_HEADERS,
timeout=timeout,
)
if r.status_code == 200:
return r
r.raise_for_status()
Individual notebooks
Templates and deliverables are available on the Lecture 2 course page.
_script: true
This script will only execute in HTML slides
_script: true