Session 4 - Data processing and analysis

Data processing and analysis

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 4 - Data processing and analysis

Last Time

Session 4 - Data processing and analysis

Last Time

We built a lakehouse

  • We harmonised messy monthly tables into one clean table
  • We stored it as partitioned Parquet in a small lakehouse
  • Cheap storage, columnar files, query-ready by year and month

As an example, our GEIH lakehouse lives under geih-spine, split into year and month.

Data lake, warehouse, and lakehouse
Session 4 - Data processing and analysis

Last Time

Big data pipeline, process stage

Storing was never the goal

  • We started from a question, not from a tool
  • Access and storage only set the table
  • Today we process: turn stored data into answers

The stored data is neutral. The question, who is employed by department, is answered now.

Session 4 - Data processing and analysis

Last Time

One question, many ways to answer it

  • The same query runs on many processing engines
  • We follow the history that produced them
  • From MapReduce to modern query engines

"Big Data consists of extensive datasets ... that require a scalable architecture for efficient storage, manipulation, and analysis." (NIST, 2015)

We already did storage. Manipulation and analysis are today.

Session 4 - Data processing and analysis

The Need to Process

Session 4 - Data processing and analysis

The Need to Process

Stored data answers nothing on its own. What do we want to know?

Session 4 - Data processing and analysis

The Need to Process

Processing is its own stage

  • Filter: keep only the rows that matter
  • Aggregate: count, sum, and average over groups
  • Derive: compute new facts the raw data did not state

As an example, the lakehouse stores activity as a raw code. Deciding who counts as employed happens here, when we process.

Pipeline template, process stage
Session 4 - Data processing and analysis

The Need to Process

Operational systems to data warehouse to analytics

Early processing: from records to reports

  • First systems just recorded transactions, one at a time
  • Managers wanted summaries, not single records
  • So data was copied into a separate place to analyse

Reading for analysis is a different job from writing transactions. That split is what created processing as a stage.

Session 4 - Data processing and analysis

The Need to Process

Operational systems to data warehouse to analytics

Data warehouses and history

  • A warehouse integrates clean data from many systems
  • It keeps years of history, organised by time
  • This is OLAP: few big reads that scan and group

Keeping history is what made analysis possible. You cannot study a trend you did not store.

Session 4 - Data processing and analysis

The Need to Process

The web breaks the single machine

  • Web 2.0 and 3.0 produced endless, messy streams
  • Clicks, text, maps, and sensors, not neat tables
  • Too much data to process on one computer

The same variety that changed storage now changes compute. One machine can no longer scan it all.

Distributed, decentralised compute
Session 4 - Data processing and analysis

The Need to Process

The answer was to spread the work across many machines. The first widely used recipe for that gave big data its mental model, MapReduce.

Session 4 - Data processing and analysis

Big Data Processing

Session 4 - Data processing and analysis

MapReduce

A recipe for many machines (2004)

  • Google needed to process the whole web
  • The trick: express work as map then reduce
  • The system handles splitting, failures, and machines

You write two small functions. The framework runs them across a cluster.

"MapReduce is a programming model and an associated implementation for processing and generating large data sets." (Dean & Ghemawat, 2008)

Its open-source version, Hadoop MapReduce, started the big data tooling era.

Session 4 - Data processing and analysis

MapReduce

Map, then group by key, then reduce

MapReduce dataflow: input, map, shuffle, reduce
Session 4 - Data processing and analysis

MapReduce

Three phases

  • Map: emit a key-value pair per record
  • Shuffle: group all values by key
  • Reduce: aggregate each group

Here the key is department and the value is the survey weight.

def map_partition(path):
    df = read(path, ["department", "activity", "weight"])
    emp = df[df.activity == 1]            # derive: employed
    return zip(emp.department, emp.weight)  # emit (key, value)

def reduce(key, values):                  # one group at a time
    return {"count": len(values),
            "sum": sum(values)}
Session 4 - Data processing and analysis

MapReduce in R

The usual method first

  • Load the table, then group by a key
  • group_by(...) %>% summarise(...) is the reduce
  • Fine while the data fits in memory

Most of you already write this with dplyr.

library(arrow); library(dplyr)

open_dataset("geih-spine/") %>%
  filter(actividad == 1) %>%          # filter: employed
  collect() %>%
  group_by(dpto) %>%                  # group by key
  summarise(
    employed     = n(),
    weighted_sum = sum(factor_expansion)
  )
Session 4 - Data processing and analysis

MapReduce in R

By hand: one file at a time

  • Map: read one partition, keep employed
  • Reduce: group_by per file
  • lapply over files is the map step

Same pattern as the Python notebook, in R syntax.

map_partition <- function(path) {
  df <- read_parquet(path,
    col_select = c("dpto", "actividad", "factor_expansion"))
  df[df$actividad == 1, c("dpto", "factor_expansion")]  # emit
}

reduce_part <- function(emp) {
  emp %>% group_by(dpto) %>%
    summarise(count = n(), sum = sum(factor_expansion))
}

parts  <- lapply(files, \(f) reduce_part(map_partition(f)))
result <- bind_rows(parts)   # then average by dpto
Session 4 - Data processing and analysis

MapReduce

MapReduce is the mental model, but writing it by hand is verbose. Modern query engines run the same map, group, and reduce for us, and also handle everyday data operations. We meet three: DuckDB, pandas, and Polars.

Session 4 - Data processing and analysis

Query Engines

CRUD: the four basic operations

  • Create: add new rows
  • Read: select and filter rows
  • Update: change values in place
  • Delete: remove rows

Analysis adds two more

  • Group by a key, such as department
  • Aggregate each group: count, sum, average

Every engine offers all of these. Only the syntax changes.

Session 4 - Data processing and analysis

Query Engines

DuckDB: CRUD in SQL

  • An analytical database inside the notebook, no server
  • SQL is the standard language of databases
INSERT INTO people VALUES (101, 5, 1, 1.2);  -- Create
SELECT * FROM people WHERE department = 5;   -- Read
UPDATE people SET activity = 2 WHERE id = 101;  -- Update
DELETE FROM people WHERE weight IS NULL;     -- Delete
Session 4 - Data processing and analysis

Query Engines

DuckDB: group by and aggregate

  • WHERE filters rows
  • GROUP BY is the reduce
  • Reads Parquet files directly
SELECT
    department,
    COUNT(*)    AS employed,
    SUM(weight) AS weighted_sum
FROM read_parquet('geih-spine/**/*.parquet')
WHERE activity = 1          -- filter: employed
GROUP BY department         -- the reduce
ORDER BY department;
Session 4 - Data processing and analysis

Query Engines

pandas: CRUD in memory

  • The familiar DataFrame, loaded in memory
  • Index and boolean masks select rows
df.loc[len(df)] = {"id": 101, "department": 5,
                   "activity": 1, "weight": 1.2}  # Create
df[df["department"] == 5]                         # Read
df.loc[df["id"] == 101, "activity"] = 2           # Update
df = df[df["weight"].notna()]                     # Delete
Session 4 - Data processing and analysis

Query Engines

pandas: group by and aggregate

  • filter with a boolean mask
  • groupby(...).agg(...) is the reduce
  • Great up to a few million rows
df["employed"] = df["activity"] == 1       # derive
(df[df["employed"]]                        # filter
   .groupby("department")                  # group by key
   .agg(employed=("employed", "count"),
        weighted_sum=("weight", "sum")))
Session 4 - Data processing and analysis

Query Engines

Polars: CRUD, lazy and parallel

  • Built for speed, uses all CPU cores
  • Columns are added with with_columns
df = pl.DataFrame({"id": [101], "department": [5],
                   "activity": [1], "weight": [1.2]})  # Create
df.filter(pl.col("department") == 5)                   # Read
df = df.with_columns(employed=pl.col("activity") == 1) # Update
df = df.filter(pl.col("weight").is_not_null())         # Delete
Session 4 - Data processing and analysis

Query Engines

Polars: group by and aggregate

  • Lazy: it builds a plan and waits
  • Nothing runs until collect()
  • The engine optimises the whole query first
(pl.scan_parquet("geih-spine/**/*.parquet")  # lazy, reads nothing yet
   .filter(pl.col("activity") == 1)           # filter
   .group_by("department")                    # group by key
   .agg(pl.len().alias("employed"),
        pl.col("weight").sum().alias("weighted_sum"))
   .collect())                                # run the plan now
Session 4 - Data processing and analysis

Query Engines

One question, four engines, one answer. MapReduce shows the pattern. Pandas, DuckDB, and Polars implement it for us. Which is fastest depends on the data and the machine, so we measure it in the notebook, not on a slide.

Session 4 - Data processing and analysis

Governance

Session 4 - Data processing and analysis

Governance

Aggregating by department is safe: each cell holds thousands of people. But finer results leak. A table of department x age can leave cells with just a few people, and those few can be re-identified.

Session 4 - Data processing and analysis

Governance

k = 5 suppression

  • From lecture 2: hide any cell with fewer than k people
  • Simple, easy to explain to a data owner
  • Cost: small cells disappear, we lose fine detail

Differential privacy

  • From lecture 3: publish every cell, but add calibrated noise
  • A guarantee that no single person changes the result much
  • Cost: every number is slightly wrong, tuned by epsilon
Session 4 - Data processing and analysis

Governance

Suppression VS noise is a real choice, not a default. Hide the small cells, or keep them all but blur every count. Your group decides which fits the project, and writes down why.

Session 4 - Data processing and analysis

Assess and Address

Session 4 - Data processing and analysis

Data Science Methodology

Running pandas, DuckDB, or Polars is not the end of processing. It is how you reach the data. After that you still assess what you got and address the decision question.

Session 4 - Data processing and analysis

Data Science Methodology

Data Science Process
Session 4 - Data processing and analysis

Data Assess

Understand and trust the query result

  • Profile: row counts, missing weights, duplicate keys
  • Validate rules: does activity == 1 match your employ definition?
  • Check coverage: all months / departments present?
  • Governance: quasi-identifiers still in stored files?

Data cleaning

  • Drop or impute missing expansion factors with justification
  • Resolve inconsistent codes across years
  • Document every transform in the notebook or manifest
Session 4 - Data processing and analysis

Data Address

Use the data to answer the policy question

  • Aggregates: weighted employment by department
  • Artifacts: tables, charts, maps for stakeholders
  • Decisions: what we can claim and what we cannot

Address is not only machine learning. Many big-data projects stop at reliable, governed aggregates and indicators.

Later in the course some groups may add models (L6). The pipeline still flows:

Access (ingest) → Assess (quality) → Address (evidence for a decision).

Week 2 group notebook: build the lakehouse, query it, assess outputs, address with a publishable aggregate under k-anonymity rules.

Session 4 - Data processing and analysis

Data Architecture

Session 4 - Data processing and analysis

Data Architecture

Data architecture shows how data moves from sources through storage and processing to the people and tasks that use it — assess and address.

Session 4 - Data processing and analysis

Data Architecture

Typical layers

  • Sources: GEIH CSV, OSM, other APIs
  • Storage format: Parquet, partitions, lakehouse path
  • Processing: harmonise, query engines, MapReduce
  • Consumers: quality checks (assess), analytics & reports (address)

Last week you wrote requirements. This week you sketch architecture — how your group will implement those requirements.

Big data pipeline template
Session 4 - Data processing and analysis

Data Architecture

Reference design (research ICU data platform)

aICU access architecture: sources, harmonisation, query engine, consumers
Session 4 - Data processing and analysis

Conclusions

Session 4 - Data processing and analysis

Conclusions

  • Processing is a pipeline that produces answers
  • Streams pushed compute across many machines
  • MapReduce is the mental model: map, group, reduce
  • Big data engines
  • Governance
  • Assess and address
  • Data architecture
aICU access architecture: sources, harmonisation, query engine, consumers
Session 4 - Data processing and analysis

This Week Notebooks

Session 4 - Data processing and analysis

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true