Session 1 - Introduction to Big Data, methodology, and ecosystems

Introduction to Big Data, methodology, and ecosystems

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

We are learning to apply the Big Data process end to end, prioritising the problem, using real data, current tools, and collaborative work. We should continuously reflect about the learning process with respect to our own domains and professional practice.

Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

A combination of theory and practicals, relying strongly on collaborative and deliberative work. Resources are available on the course website.


Async sessions designed to last 4 hours before Saturday

  • Async videos with theoretical concepts and demos of the individual notebook
  • Students go through individual notebooks themselves

Sync sessions on Saturday, 07:00 to 13:00 Colombia time

  • Discussion of the week's readings in groups and plenary
  • Teamwork on the week's notebooks
  • Teamwork on the course project
  • Sharing findings and questions in plenary

Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

Homework

  • Personal learning journal to document readings from relevant literature before Saturday sessions. Reading lists are shared the week before.
  • Consolidated group notebook with the weekly practical. Group notebook explored and developed in Saturday sessions. Due on Mondays.
  • Individual reflections on weekly learning. Due on Tuesdays using the provided template.
  • Course project presented on our final day: Saturday, 27 June.
  • Submissions through Moodle.
Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

Assessment

  • Course mark: 60% formative (weeks 1 to 3) + 40% final project
  • Weekly formative (20% per week):
    • Group notebook 12%, individual reflection PDF 8%
    • Group notebook due on Mondays and individual reflection due on Tuesdays
  • Final project (40% of the course):
    • Group deliverables: 70% of the project:
      • Integrated notebook or source code
      • Pipeline diagram
      • Decision brief
      • Live presentation
    • Individual project reflection: 30% of the project
  • Personal learning journal and Saturday reading discussions are required but not graded
Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

A few comments

  • The course follows an active learning approach. Your participation is crucial for everyone to succeed
  • Turn on your camera during Saturday sync sessions unless you have agreed an exception with the instructor
  • Complete async videos and the individual practice notebook before Saturday
  • Bring your learning journal to reading discussions. Reading lists are shared the week before
  • Use your group Microsoft Teams link for in-group work during sync sessions
  • You are free to use AI during the course but declare your use honestly in every reflection. Do not submit wholesale AI-generated reflections
  • Respect data ethics on the explored datasets
Session 1 - Introduction to Big Data, methodology, and ecosystems

Course Structure

Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Industry and research stretched the term as storage, computing, and pipelines changed.

"Data storage is growing at a higher rate than ever before, and coupled with rapidly increasing demand for instant access, will cause great stress on both the physical and the human infrastructure of computing."

(Mashey, 1999)

"Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze."

(Manyika et al., 2011)

"Big Data is a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology."

(boyd & Crawford, 2012)

"Big Data consists of extensive datasets that require a scalable architecture for efficient storage, manipulation, and analysis because of data volume, variety, velocity, and/or variability."

(NIST, 2015)

Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Common dimensions: 3Vs? 4Vs? 5Vs?

  • Volume - data no longer fits comfortably in memory
  • Velocity - data arrives continuously or in bursts
  • Variety - tables, files, streams, and maps mixed together
  • Veracity - quality, bias, and missing values matter
  • Value - data must support a decision or action that matters
Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Moore's Law
Session 1 - Introduction to Big Data, methodology, and ecosystems

AI History - Learning from Examples (1987 - present)

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 Artificial Neuron (McCulloch & Pitts, 1943) First AI Winter (1974-1980) Second AI Winter (1987-1994) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCarthy, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblatt, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumelhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997)
Session 1 - Introduction to Big Data, methodology, and ecosystems

AI History - Big Data (2000s)

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 Artificial Neuron (McCulloch & Pitts, 1943) First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCarthy, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblatt, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumelhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997)
Session 1 - Introduction to Big Data, methodology, and ecosystems

AI History - Machine Learning Age (2001 - present)

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCarthy, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblatt, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumelhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997) Deep Learning (Hinton, 2006) Watson wins Jeopardy (2011) AlexNet (Krizhevsky, 2012) GANs (Goodfellow, 2014) AlphaGo beats Lee Sedol (DeepMind, 2016) Transformer (Vaswani, 2017) AlphaFold (DeepMind, 2018) GPT-1 (OpenAI, 2020) BERT (Google, 2019) Chinchilla (DeepMind, 2022) ChatGPT (OpenAI, 2022) LLaMA (Meta AI, 2023) Claude 2 (Anthropic, 2023) phi-3 (Microsoft, 2024) Gemini 1.5 (Google DeepMind, 2024) Qwen3 (Alibaba, 2025) R1 (DeepSeek) 2025)
Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Big data in the 2000s

Large and complex datasets improve models' statistical power.

  • Social networks
  • Mobile computing
  • Internet of Things
  • Public and official statistics at a national scale
Big Data dimensions
Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

Big data developments

  • Hadoop and HDFS: distributed file systems
  • MapReduce: batch jobs split into scalable steps
  • Apache Spark: distributed processing
  • NoSQL stores: Cassandra, HBase, MongoDB
  • Stream processing: Kafka, Flink for live data
  • Cloud warehouses: BigQuery, Redshift, Snowflake
  • Lakehouse formats: Parquet, Delta Lake, Iceberg
  • Orchestration: Airflow, Prefect for data pipelines
Big Data dimensions
Session 1 - Introduction to Big Data, methodology, and ecosystems

What is Big Data

In this course, we explore situations where at least one dimension of big data forces a non-trivial choice across the data management process.

  • Formal labels and Vs are useful, but they are not enough on their own
  • More data is not automatically better understanding (boyd & Crawford, 2012)
  • We start from the problem and the decision, not from a platform logo
  • For us, the most important V is value
Big Data dimensions
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Technocentric view

Start from the technologies and tools, then look for a problem that fits the tool.

ML System?
https://xkcd.com/1838/, CC BY-NC 2.5 , via XKCD
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Technocentric view


  • Mundane applications for problems we did not know we had
  • Disregard for social and environmental implications
  • Unrealistic expectations and hype
  • Exclusion of diverse perspectives and voices
  • Unsustainable technologies
  • Increased inequality and digital divide
  • Security and privacy concerns
  • ...
ML System?
https://xkcd.com/1838/, CC BY-NC 2.5 , via XKCD
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

The Systems View
AI System
AI System

The systems view prioritises the problem before anything. (Cabrera et al., 2025)

Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Context and people
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Questions before tools:

Talk or observe the system's stakeholders

  • Who needs a decision and why now?
  • What would count as success?
  • What constraints apply in the real organisation?
  • What data would you need to defend an answer?
  • What harm could a wrong answer cause?
Project canvas
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

MLTRL - Technology Readiness Levels for Machine Learning Systems (Lavin et al., 2022)

MLTR Framework
Session 1 - Introduction to Big Data, methodology, and ecosystems

Problem First

Course project

  • You are an analytics team supporting public policy in Colombia
  • Your readers are non-technical decision makers
  • Your accountability is to the public whose data you handle

Pick one decision option

  • Resource allocation: where to target programmes
  • Risk or early warning: where conditions worsen fastest
  • Monitoring: how an indicator moves over time

Reference question: What does our pipeline say about labour patterns for a defined population, and what would we and would we not recommend on that basis? Course project definition (PDF)

Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Pipeline template
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Cloud setup (client-server)

  • A server exposes an interface over the network
  • A client consumes that interface to request data or compute
  • Big data pipelines often split work across both sides
Client-server architecture
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

SOA is a design pattern in which services are provided between components, through a communication protocol over a network.


Microservices are an architectural style that structures an application as a collection of small, autonomous services. Each microservice is self-contained and exposes a business capability, which is implemented by an object (i.e., OOP).


The concept of "Everything as a Service" (XaaS) extends the principles of SOA and microservices by offering comprehensive services over the internet. XaaS encompasses a wide range of services, including infrastructure, platforms, and software.

AI System
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

AI as a Service (AIaaS) enables us to access and expose AI capabilities over the internet. We can integrate AI tools such as machine learning models, natural language processing, and computer vision into our applications leveraging SOA and microservices features.

AI System
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem


from flask import Flask, request, jsonify
app = Flask(__name__)
class SentimentAnalysisService:
    def __init__(self, model):
        self.model = model

    def analyze_sentiment(self, text):
        sentiment_score = self.model.predict(text)
        if sentiment_score > 0.5:
            return "Positive"
        elif sentiment_score < -0.5:
            return "Negative"
        else:
            return "Neutral"
...
@app.route('/analyze', methods=['POST'])
def analyze():
    data = request.get_json()
    text_to_analyze = data.get('text', '')
    sentiment = service.analyze_sentiment(text_to_analyze)
    return jsonify({'sentiment': sentiment})
...
AI System
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem


from flask import Flask, request, jsonify
app = Flask(__name__)
class SentimentAnalysisService:
    def __init__(self, model):
        self.model = model

    def analyze_sentiment(self, text):
        sentiment_score = self.model.predict(text)
        if sentiment_score > 0.5:
            return "Positive"
        elif sentiment_score < -0.5:
            return "Negative"
        else:
            return "Neutral"
...
@app.route('/analyze', methods=['POST'])
def analyze():
    data = request.get_json()
    text_to_analyze = data.get('text', '')
    sentiment = service.analyze_sentiment(text_to_analyze)
    return jsonify({'sentiment': sentiment})
...

Focus on Operations:

  • Separation of concerns
  • High availability
  • Scalability
  • Low latency

Data is secondary and hidden behind services' interfaces, making data collection difficult

Centralised deployments are not sustainable and threatens data privacy and ownership

Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Data-Orientated Architectures make data available by design facilitating monitoring and maintenance. Decentralisation supports local data processing, reducing latency and improving privacy by respecting data ownership. Openness enables managing resource-constrained environments by exploiting the computing power of everyday devices (Cabrera et al., 2025).


Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Edge deployment

Edge computing

  • Process and filter close to sensors, devices, or users
  • Edge nodes have limited compute and storage
  • Trade-offs vs cloud: latency, connectivity, cost, governance, and data ownership
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Big Data Pipeline and Ecosystem

Pipeline template
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Data Science Process

Session 1 - Introduction to Big Data, methodology, and ecosystems

The Data Science Process

Session 1 - Introduction to Big Data, methodology, and ecosystems

The Data Science Process

Data Science Process
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Data Science Process

Data Science Process
Session 1 - Introduction to Big Data, methodology, and ecosystems

The Data Science Process

Data Science Process
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Does the data even exist?

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

No, it does not exist!

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Primary Data Collection

  • Questionnaires
  • Interviews
  • Focus group interviews
  • Surveys
  • Case studies
  • Process analysis
  • Experimental method
  • Statistical method
  • ...

No, it does not exist!

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Primary Data Collection

  • Questionnaires
  • Interviews
  • Focus group interviews
  • Surveys
  • Case studies
  • Process analysis
  • Experimental method
  • Statistical method
  • ...
  • Expensive data collection processes
  • Particular methodologies
  • None fits all needs
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Yes, it does exist!

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Secondary Data Collection

  • Published printed sources
  • Books, journals, magazines, newspapers
  • Government records
  • Census data
  • Public sector records
  • Electronic sources (e.g., public datasets, websites, etc.)
  • ...

Yes, it does exist!

Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Secondary Data Collection

  • Published printed sources
  • Books, journals, magazines, newspapers
  • Government records
  • Census data
  • Public sector records
  • Electronic sources (e.g., public datasets, websites, etc.)
  • ...
  • Validity and reliability concerns
  • Outdated data
  • Relevancy issue
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Crash Map Kampala

  • Road traffic accidents are a leading cause of death for the young in many contexts
  • Data is difficult to access and made impossible any analysis of the problem
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Crash Map Location
Crash Map Kampala was an initiative by Michael T. Smith and Bagonza Jimmy Owa Kinyonyi to map the location, date and severity of vehicle accidents across the city of Kampala. Original storage location for the data was in police logbooks.

Crash Map Kampala

  • Road traffic accidents are a leading cause of death for the young in many contexts
  • Data is difficult to access and made impossible any analysis of the problem
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Crash Map Date Time
Crash Map Kampala was an initiative by Michael T. Smith and Bagonza Jimmy Owa Kinyonyi to map the location, date and severity of vehicle accidents across the city of Kampala. Original storage location for the data was in police logbooks.

Crash Map Kampala

  • Road traffic accidents are a leading cause of death for the young in many contexts
  • Data is difficult to access and made impossible any analysis of the problem
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Crash Map Severity
Crash Map Kampala was an initiative by Michael T. Smith and Bagonza Jimmy Owa Kinyonyi to map the location, date and severity of vehicle accidents across the city of Kampala. Original storage location for the data was in police logbooks.

Crash Map Kampala

  • Road traffic accidents are a leading cause of death for the young in many contexts
  • Data is difficult to access and made impossible any analysis of the problem
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Crash Map Vehicles 2
Crash Map Kampala was an initiative by Michael T. Smith and Bagonza Jimmy Owa Kinyonyi to map the location, date and severity of vehicle accidents across the city of Kampala. Original storage location for the data was in police logbooks.

Crash Map Kampala

  • Road traffic accidents are a leading cause of death for the young in many contexts
  • Data is difficult to access and made impossible any analysis of the problem
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

The dataset creation is a critical process that cannot be skipped if we want to implement a big data project. We can assume datasets exist, but they will likely enable toy projects not relevant for our society

Crash Map Vehicles 2
Crash Map Kampala was an initiative by Michael T. Smith and Bagonza Jimmy Owa Kinyonyi to map the location, date and severity of vehicle accidents across the city of Kampala. Original storage location for the data was in police logbooks.
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Gran Encuesta Integrada de Hogares (GEIH)

  • Published by DANE (Departamento Administrativo Nacional de Estadística)
  • Official basis for labour market statistics in Colombia
  • Monthly ZIPs with labour, demographics, housing, and other modules
  • DANE does not expose an API, we must do web crawling.
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Accessing GEIH via Web Crawling

  • Each month has a catalog id and file id on the DANE portal
  • Build a direct download URL and stream the ZIP to disk
  • Lecture 1 notebook
download_url = (
    f"https://microdatos.dane.gov.co/index.php/catalog/"
    f"{CATALOG_ID}/download/{FILE_ID}"
)
with requests.get(download_url, stream=True, timeout=600) as resp:
    resp.raise_for_status()
    with zip_path.open("wb") as fh:
        for chunk in resp.iter_content(chunk_size=1 << 20):
            if chunk:
                fh.write(chunk)
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

OpenStreetMap

  • Collaborative map with tagged points of interest
  • Coverage and quality vary by region
  • OSM adds geographic context
  • OSM exposes the public Overpass API
Map
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Query OSM with Overpass

  • POST an Overpass QL query to a public endpoint
  • Set a descriptive User-Agent
  • Lecture 2 notebook
def overpass_post(query: str, *, timeout: int = 90) -> requests.Response:
    for url in OVERPASS_ENDPOINTS:
        r = requests.post(
            url,
            data={"data": query},
            headers=OVERPASS_HEADERS,
            timeout=timeout,
        )
        if r.status_code == 200:
            return r
    r.raise_for_status()
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Data Science Process
Session 1 - Introduction to Big Data, methodology, and ecosystems

Data Access

Pipeline template
Session 1 - Introduction to Big Data, methodology, and ecosystems

Conclusions

Session 1 - Introduction to Big Data, methodology, and ecosystems

Conclusions

Lecture 1

  • Big data as a design problem
  • Problem first and systems view
  • The Big data pipeline
  • The "Access" stage

Lecture 2

  • Big data requirements
  • Ethics challenges
  • Data privacy and ownership
  • Data governance
Session 1 - Introduction to Big Data, methodology, and ecosystems

This Week Notebooks

Individual notebooks

Templates and deliverables are available on the Lecture 2 course page.

Session 1 - Introduction to Big Data, methodology, and ecosystems

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true