Session 3 - Data storage and management

Data storage and management

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 3 - Data storage and management

Last Time

Session 3 - Data storage and management

Last Time

Problem first, not tool first

  • We support public policy decisions in Colombia
  • We started from the question, not from a platform logo
  • We accessed real data, the Gran Encuesta Integrada de Hogares (GEIH) and OpenStreetMap
Context and people
Session 3 - Data storage and management

Last Time

Ethics, privacy, and fairness at the start

  • We audited GEIH for harm, consent, and re-identification risk
  • Quasi-identifiers and k-anonymity framed how we may publish
  • Governance is a requirement, not a final checklist

"Big Data is a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology." (boyd & Crawford, 2012)

Session 3 - Data storage and management

Last Time

Big data pipeline
Session 3 - Data storage and management

Last Time

Your words from the week 1 discussion
Week 1 discussion word cloud
Session 3 - Data storage and management

The Need to Store

Session 3 - Data storage and management

The Need to Store

Where does the data live after the program ends?

Session 3 - Data storage and management

The Need to Store

Raw files pile up fast

  • One data file per period, hundreds of MB each (for GEIH, one ZIP per month)
  • Several tables to join on every run
  • Related sources to keep side by side

Reading and joining raw files on every run is slow, fragile, and hard to share with a teammate.

Data quality and usability
Session 3 - Data storage and management

The Need to Store

Storage is its own stage of the pipeline

  • Durability: data outlives the notebook session
  • Shareability: a team reads the same clean dataset
  • Scale: read one month without loading the whole year
  • Cost: compressed bytes are cheaper to keep and move
Pipeline template, store stage
Session 3 - Data storage and management

The Need to Store

How we store data has changed with every era of computing. The right choice depends on who writes, who reads, and what they ask. To understand today's formats, we follow the history that produced them.

Session 3 - Data storage and management

Systems and the Web Evolve

Early information system

Early information systems

  • Banks, governments, and universities recorded transactions
  • Data was structured, repeated, and had to stay consistent
  • One wrong balance or duplicated record was a real problem

The first need was not scale. It was correctness of structured records that many programs shared.

Session 3 - Data storage and management

Systems and the Web Evolve

The web changes who produces data

  • Web 1.0: few publishers, many readers, static pages
  • Web 2.0: everyone writes, social networks, comments, photos
  • Web 3.0: linked and machine-readable data, maps, sensors
First web server at CERN
The NeXT computer used by Tim Berners-Lee as the first web server. Photo by Coolcaesar, CC BY-SA 3.0, via Wikimedia Commons
Session 3 - Data storage and management

Systems and the Web Evolve

From structured to messy

  • Tables of transactions still exist and still matter
  • But now we also store text, images, clickstreams, and maps
  • This is the variety dimension of big data from lecture 1

"Big Data consists of extensive datasets that require a scalable architecture for efficient storage, manipulation, and analysis because of data volume, variety, velocity, and/or variability." (NIST, 2015)

{
  "type": "node",
  "id": 1234567,
  "lat": 1.2136, "lon": -77.2811,
  "tags": {
    "amenity": "school",
    "name": "School 1"
  }
}
A semi-structured OpenStreetMap node. No fixed table, just nested keys and values.
Session 3 - Data storage and management

Systems and the Web Evolve

Different needs produced different databases. Next we look at the two big families that answer them, relational and non-relational.

Session 3 - Data storage and management

Databases

Session 3 - Data storage and management

Relational Databases

id_householdorderage
A-1140
A-1212
B-3129

Rows, tables, and keys

  • Data lives in tables of rows and typed columns
  • Tables link through keys, no value is repeated needlessly
  • We query with SQL, a declarative language

As an example, GEIH is relational in spirit. A household table and a person table share the keys DIRECTORIO, HOGAR, and ORDEN.

Session 3 - Data storage and management

Relational Databases

Labour table

id_householdorderactivity
A-111
A-122

Demographics table

id_householdorderage
A-1140
A-1212
Session 3 - Data storage and management

Relational Databases

Why institutions trusted them: ACID

  • Atomicity: a transaction happens fully or not at all
  • Consistency: the data always satisfies its rules
  • Isolation: concurrent users do not corrupt each other
  • Durability: a committed change survives a crash

This is the world of OLTP, online transaction processing. Many small reads and writes that must never lose money or break a rule.

Session 3 - Data storage and management

Relational Databases

Relational systems are excellent when data is structured and consistency is sacred. The web then produced data that did not fit neat tables, and that broke some of these assumptions.

Session 3 - Data storage and management

Non-Relational Databases

When tables stop fitting

  • Social and web data is huge, sparse, and changes shape
  • One global table on one machine cannot keep up
  • NoSQL stores relax the strict relational model to scale out
Non-relational database families
Session 3 - Data storage and management

Non-Relational Databases

Four common families

  • Key-value: a giant dictionary, fast lookups (Redis, DynamoDB)
  • Document: nested JSON per record (MongoDB)
  • Column-family: wide sparse rows at scale (Cassandra, HBase)
  • Graph: nodes and edges for relationships (Neo4j)

Each family optimises for a shape of data and a pattern of access. None is a universal replacement for the others.

Non-relational database families
Session 3 - Data storage and management

Non-Relational Databases

Distributed, decentralised storage

The CAP trade-off

Once data is spread across many machines, a store cannot fully guarantee all three at once.

  • Consistency: every read sees the latest write
  • Availability: every request gets an answer
  • Partition tolerance: it survives a broken network

When the network can split, you must choose between consistency and availability. (Gilbert & Lynch, 2002) Banks lean to consistency; a social feed leans to availability.

Session 3 - Data storage and management

Non-Relational Databases

Relational or not, every store makes one core bet about read VS write. That bet is what decides the format we use for GEIH.

Session 3 - Data storage and management

Read VS Write

Two opposite jobs

  • OLTP: many small writes, read whole records, one at a time
  • OLAP: few big reads, scan one column over millions of rows

Our GEIH question is OLAP. We ask for employment by department, not for one person's full record.

The layout that is fast to write a row is slow to scan a column, and the reverse.

Session 3 - Data storage and management

Read VS Write

Row versus columnar storage layout
Session 3 - Data storage and management

Read VS Write

Why columns compress so well

  • Values in one column share a type and look alike
  • Dictionary encoding: store each distinct value once, then small codes
  • Run-length encoding: store a repeated value as value times count

A gender column of millions of 1s and 2s collapses to almost nothing. Smaller bytes mean faster reads and cheaper storage.

Session 3 - Data storage and management

Big Data Storage

Session 3 - Data storage and management

Big Data Storage History

One web, too much data for one machine

By the early 2000s a single server could not hold or serve the web. Google had to store and index the whole internet, so it built its own stack.

Session 3 - Data storage and management

Big Data Storage History

Google File System (2003)

  • Spread one huge file across many cheap machines
  • Replicate blocks so failure is normal, not fatal
  • Optimised for large streaming reads, not tiny edits

"Component failures are the norm rather than the exception." (Ghemawat et al., 2003)

The open-source version of this idea became Hadoop HDFS.

Distributed, decentralised storage
Session 3 - Data storage and management

Big Data Storage History

Bigtable (2006)

  • A distributed table for billions of rows
  • Sparse, wide, and sorted by key
  • Inspired Cassandra and HBase

"A Bigtable is a sparse, distributed, persistent multidimensional sorted map." (Chang et al., 2008)

Non-relational database families
Session 3 - Data storage and management

Big Data Storage History

Dremel (2010) leads to Parquet

  • Columnar storage for fast analytics at web scale
  • Scan a few columns over trillions of rows in seconds
  • Its column format inspired Apache Parquet, our format

"Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data." (Melnik et al., 2010)

Non-relational database families
Session 3 - Data storage and management

Big Data Storage History

We will use Parquet in our practical, the open descendant of this lineage. Getting messy sources into that format is an ETL job. Extract, transform, load.

Session 3 - Data storage and management

ETL

Extract, Transform, Load

  • Extract: get the data from its source
  • Transform: clean, join, rename, fix types
  • Load: write it where analysts can use it

ETL is the classic recipe that moves data from messy sources into a clean store. Storing and processing are both part of it.

Pipeline template
Session 3 - Data storage and management

ETL

Our course pipeline is ETL

  • Extract: we crawled DANE for GEIH in week 1
  • Transform: we harmonise the monthly tables today
  • Load: we write partitioned Parquet to Drive
Pipeline template
Session 3 - Data storage and management

ETL

ETL versus ELT

ETL or ELT? Warehouses transform before load, so data lands clean. Lakehouses load raw and transform at query time, when an engine reads it.

Session 3 - Data storage and management

Partitioning

Partition tree by key

Split the dataset by a key

  • Store each slice in its own folder named key=value
  • For example, we can split by year and month
  • The folder name is data, not just a label

A query for one month reads only that folder and skips the rest. This is called partition pruning.

Session 3 - Data storage and management

Partitioning

Hive-style partition tree
Session 3 - Data storage and management

Partitioning

In the notebook

  • One file per month under year and month
  • Written once, read many times
  • Lecture 3 notebook
part_dir = PROCESSED_DIR / f"year={y}" / f"month={m:02d}"
part_file = part_dir / "part-000.parquet"

part_dir.mkdir(parents=True, exist_ok=True)
sample.to_parquet(part_file, index=False)
Session 3 - Data storage and management

Harmonisation

raw codereadable
P6040age
P3271gender
P6240activity
FEX_C18factor_expansion

Make the data comparable and readable

  • Join the labour and demographics tables per person
  • Rename source codes to readable names
  • Fix types, text to numbers where needed
  • Derive the partition keys, year and month

The goal is a clean table that any teammate can read without the source codebook open.

Session 3 - Data storage and management

Harmonisation

DANE code

P6040age
P3271gender
P6240activity
FEX_C18factor_expansion

Readable name

The same value, now with a name a person understands. We keep actividad as the raw code and interpret it later, during processing.

Session 3 - Data storage and management

Lakehouse

Data lake, warehouse, and lakehouse
Session 3 - Data storage and management

Lakehouse

"A lakehouse is a data management system based on low-cost and directly accessible storage that also provides traditional analytical DBMS management features." (Armbrust et al., 2021)

Session 3 - Data storage and management

Conclusions

Session 3 - Data storage and management

Conclusions

  • Storage is a pipeline stage with its own choices
  • Relational and NoSQL answer different needs
  • Read VS write decides row or columnar layout
  • Parquet stores columns, compresses, and reads fast
  • Partitioning and harmonisation make the data usable
  • The lakehouse keeps lake flexibility with warehouse guarantees
Pipeline template
Session 3 - Data storage and management

This Week Notebooks

Session 3 - Data storage and management

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true