Session 5 - Data ingestion and workflow

Data ingestion and workflow

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 5 - Data ingestion and workflow

Last Time

Session 5 - Data ingestion and workflow

Last Time

Week 2 discussion word cloud
Session 5 - Data ingestion and workflow

Last Time

Can we predict everything?

"An intellect which ... would know all forces that set nature in motion ... for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes. "(Laplace, 1814)

Laplace's demon
Laplace’s demon working away on the natural laws, data and computation necessary for deterministic prediction. (Lawrence, 2026)

Session 5 - Data ingestion and workflow

Last Time

"...the curve described by a molecule of air ... is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance." (Laplace, 1814)

Laplace's demon
Laplace’s demon working away on the natural laws, data and computation necessary for deterministic prediction. (Lawrence, 2026)

Session 5 - Data ingestion and workflow

Last Time

Why the demon is a myth

  • Perfect prediction would need complete laws, complete data, and unlimited compute
  • In practice all three are incomplete, so we model uncertainty with probability
  • Quantum mechanics suggests that unpredictability is hard law of nature.
Laplace's demon
Laplace’s demon working away on the natural laws, data and computation necessary for deterministic prediction. (Lawrence, 2026)

Session 5 - Data ingestion and workflow

Last Time

We built a lakehouse

  • Raw files landed cheaply, like a lake
  • We queried them with warehouse-style structure
  • A lakehouse is both: low-cost storage plus management features
Data lake, warehouse, and lakehouse
Session 5 - Data ingestion and workflow

Last Time

We processed at scale with MapReduce

MapReduce dataflow: input, map, shuffle, reduce
Session 5 - Data ingestion and workflow

Last Time

Storage and processing are stages of one pipeline — today we connect them with ingestion

Big data pipeline
Session 5 - Data ingestion and workflow

Data Ingestion

Session 5 - Data ingestion and workflow

Data Ingestion

Ingestion is the stage that moves data from its sources into the place where we store, process, and analyse it.

Session 5 - Data ingestion and workflow

Data Ingestion

The big data pipeline

Ingestion feds storage, processing, and analytics:

  • Curate the data we want to model
  • Consolidate the data we want to use (i.e., data staging)
Big data pipeline
Session 5 - Data ingestion and workflow

Data Ingestion

Two questions decide how we ingest any source

  • How much? the volume we must move and store
  • How often / how fresh? the velocity the decision needs

The answers split ingestion into two families: batch and stream.

Big data pipeline
Session 5 - Data ingestion and workflow

Batch and ETL

Session 5 - Data ingestion and workflow

Batch Processing

Batch collects data into a bounded chunk and processes it on a schedule. For example, every night.

Session 5 - Data ingestion and workflow

Batch Processing

Batch processing: bounded input, schedule trigger, ETL run, curated layer

Collect a complete snapshot, then process it on a clock.

Session 5 - Data ingestion and workflow

Batch Processing

Where it comes from

  • The oldest model of computing: mainframes ran scheduled jobs overnight
  • Payroll, billing, statements, and reports were produced in bulk
  • Data warehouses kept the same rhythm: load, then report
Batch processing: bounded input, schedule trigger, ETL run, curated layer
Session 5 - Data ingestion and workflow

Batch Processing

Why it still dominates

  • High throughput: move a lot of data per run
  • Latency of hours or days is fine for many decisions
  • We do it once and reuse a stable snapshot (i.e., strong consistency)
Batch processing: bounded input, schedule trigger, ETL run, curated layer
Session 5 - Data ingestion and workflow

Batch Processing

Extract, Transform, Load

  • Extract: read the data from its source
  • Transform: clean, join, rename, fix types
  • Load: write it where analysts can use it

ETL is the classic recipe of a batch pipeline. We met it in Lecture 3. Now we automate it.

Pipeline template
Session 5 - Data ingestion and workflow

Batch Processing

ETL versus ELT

ETL or ELT? Warehouses transform before load, so data lands clean. Lakehouses load raw and transform at query time. Either way we keep staging and curated layers.

Session 5 - Data ingestion and workflow

Batch Processing

A real pipeline has many steps that must run in order. If one fails, we need to know which and recover. Orchestration manages that.

Session 5 - Data ingestion and workflow

Batch Processing

What an orchestrator gives us

  • Declare steps and the order between them (a workflow)
  • Schedule runs (e.g. nightly) without a human
  • Retries on failure and observability: logs of every run
Orchestrated workflow: scheduler, Extract Transform Load Validate, governance sidecar
Session 5 - Data ingestion and workflow

Batch Processing

Orchestrated workflow: scheduler, Extract Transform Load Validate, governance sidecar

Declare steps once. Run locally today, on schedule in production.

Session 5 - Data ingestion and workflow

Batch Processing

The tools

  • Apache Airflow — the industry standard, schedules DAGs
  • Dagster — asset-oriented, strong on data lineage
  • Prefect — Pythonic flows; what we use in the lab

We express steps as tasks and connect them in a flow. The same code runs locally today and on a schedule in production.

Orchestrated workflow: scheduler, Extract Transform Load Validate, governance sidecar
Session 5 - Data ingestion and workflow

Batch Processing

A Prefect flow

  • Each @task is one ETL step
  • The @flow chains them in order
  • Prefect logs each run and can retry
from prefect import flow, task

@task
def extract(paths):              # Extract: open partitions (lazy)
    return scan_parquet(paths)

@task
def aggregate(frame, output):    # Transform + Load: write curated layer
    result = (frame.filter(...)
                   .group_by(keys).agg(...).collect())
    result.write_parquet(output)
    return output

@task
def validate(path):              # Validate: the output must have rows
    assert read_parquet(path).height > 0
    return path

@flow(name="batch_ingest")
def batch_flow(paths, output):   # one flow chains the steps
    path = aggregate(extract(paths), output)
    return validate(path)
Session 5 - Data ingestion and workflow

Batch Processing

Automating a pipeline raises a question from Lecture 2: can we trust and reproduce what it produced? Governance is built into ingestion.

Session 5 - Data ingestion and workflow

Batch Processing

Mechanisms we add to the flow

  • Audit log: who ran what, when, from which source
  • Schema contract: the columns and types a step promises
  • Lineage: trace an output back to its inputs
def append_audit(event):
    """One provenance record per pipeline step."""
    record = {**event, "ts": now_iso()}
    with open(AUDIT_LOG, "a") as f:
        f.write(json.dumps(record) + "\n")

contract = {
    "columns": {"key": "str", "count": "int"},
    "source": "curated/aggregate.parquet",
    "schema_version": 1,
    "retention_days": 365,
}
Session 5 - Data ingestion and workflow

Streams

Session 5 - Data ingestion and workflow

Streams

A stream is data that arrives continuously as events over time. It never ends and we cannot wait for a nightly batch.

Session 5 - Data ingestion and workflow

Streams

Where it comes from

  • Web and mobile logs, clickstreams, application events
  • Sensors and IoT devices emitting readings
  • Financial ticks, messages, and news feeds

These systems produce data faster than batch windows can absorb, and decisions need it soon (i.e., real time).

Stream processing: unbounded timeline, time windows, stream processor, outputs
Session 5 - Data ingestion and workflow

Streams

Stream processing: unbounded timeline, time windows, stream processor, outputs

Process events as they arrive. Results over time windows.

Session 5 - Data ingestion and workflow

Streams

Batch

  • Bounded data, run on a schedule
  • High throughput, latency hours/days
  • Consistent on a snapshot

Stream

  • Unbounded events, processed as they arrive
  • Low latency, results over windows of time
  • Eventual consistency
Session 5 - Data ingestion and workflow

Streams

Production systems run both: batch for the heavy, governed history; stream for the fresh signal. The question is how producers and consumers talk.

Session 5 - Data ingestion and workflow

Streams

In publish/subscribe, producers publish events to a named topic, and consumers subscribe to it. Neither side knows the other.

Session 5 - Data ingestion and workflow

Streams

Publish subscribe: producers, append-only topic log, consumers

A topic is an append-only log. Consumers track their offset.

Session 5 - Data ingestion and workflow

Streams

A topic is an append-only log

  • New events are added at the end, never edited
  • Each event has an offset, its position in the log
  • Events are retained, so consumers can replay
Publish subscribe: producers, append-only topic log, consumers
Session 5 - Data ingestion and workflow

Streams

Why decouple?

  • Producers and consumers scale and fail independently
  • Add a new consumer without touching producers
  • A buffer absorbs bursts of traffic
Publish subscribe: producers, append-only topic log, consumers
Session 5 - Data ingestion and workflow

Streams

The producer side

  • Serialise each event to JSON
  • send it to the topic
  • The topic is created on first write
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers="localhost:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
)

for event in events:
    producer.send("events", event)   # publish to the topic
producer.flush()                     # make sure everything left
Session 5 - Data ingestion and workflow

Streams

The consumer side

  • Read from the start of the log
  • De-duplicate by a stable id
  • Land events in a store for later
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    "events",
    bootstrap_servers="localhost:9092",
    auto_offset_reset="earliest",    # from the start of the log
    value_deserializer=lambda m: json.loads(m.decode("utf-8")),
)

seen = set()
for message in consumer:             # events as they arrive
    event = message.value
    if event["id"] not in seen:      # de-duplicate
        seen.add(event["id"])
        store.insert_one(event)      # land in a document store
Session 5 - Data ingestion and workflow

Streams

Stream events are semi-structured

  • Each event is a JSON document whose fields may vary
  • Their natural home is a document store (e.g. MongoDB), not a rigid table

{
  "id": "f3c7a12d-8d65-4132-b2a1-ef83dbcb7171",
  "type": "temperature_reading",
  "timestamp": "2023-11-07T15:03:21Z",
  "device_id": "sensor-001",
  "temperature_c": 22.8,
  "humidity_percent": 41.3,
  "location": {
      "room": "lab-3A",
      "building": "north-wing"
  }
}
Session 5 - Data ingestion and workflow

Streams

A single log on one machine is not enough at scale. Streaming platforms distribute the log and process it in parallel.

Session 5 - Data ingestion and workflow

Streams

Distribution, partitioning, and parallelism

Publish subscribe streaming architecture: producers, partitioned topic across brokers, consumer group
Session 5 - Data ingestion and workflow

Streams

How the architecture scales

  • A topic is split into partitions spread across brokers
  • One partition per consumer in the group: horizontal parallelism
  • Partitions are replicated: a broker can fail without data loss
Publish subscribe streaming architecture: producers, partitioned topic across brokers, consumer group
Session 5 - Data ingestion and workflow

Streams

The tools

  • Apache Kafka — the distributed log and broker
  • Apache Flink / Spark Streaming — stream processing engines
  • Bytewax — stream processing in pure Python

"Kafka is a distributed messaging system ... for collecting and delivering high volumes of log data with low latency." (Kreps et al., 2011)

Publish subscribe streaming architecture: producers, partitioned topic across brokers, consumer group
Session 5 - Data ingestion and workflow

Conclusions

Session 5 - Data ingestion and workflow

Conclusions

  • Ingestion feeds storage, processing, and analytics
  • Volume and velocity choose batch or stream
  • Batch: orchestrated ETL on a schedule
  • Stream: publish/subscribe over a retained log
  • Governance travels with the data
  • Production systems run both modes
Batch and stream ingestion compared side by side
Session 5 - Data ingestion and workflow

This Week Notebooks

Session 5 - Data ingestion and workflow

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true