Session 2 - Ethics, privacy, and foundations of data governance

Ethics, privacy, and foundations of data governance

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 2 - Ethics, privacy, and foundations of data governance

Last Time

Session 2 - Ethics, privacy, and foundations of data governance

Last Time

Context and people
Session 2 - Ethics, privacy, and foundations of data governance

Last Time

Big data pipeline
Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

  • Scalability: Elastic compute and storage at scale
  • Availability: Fault tolerance and replication
  • Latency: Network bandwidth and distributed execution
  • Security: identity, and access control
  • Interoperability: Service interfaces (APIs) for clients and pipelines
  • Ethics, privacy, and fairness
Client-server cloud setup
Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Apache Spark
Big Data tools — via Medium
  • Distributed file systems and batch engines (Hadoop era)
  • In-memory and stream processing (Spark, Flink, Kafka)
  • Cloud object storage, warehouses, and lakehouse platforms
  • Orchestration, catalogues, and observability tools

Most investment targets scale and speed...

Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Data governance is the formulation of policy to optimize, secure, and leverage information as an enterprise asset by aligning the objectives of multiple functions. (Khatri & Brown, 2010)


Existing efforts: GDPR-style regulation, public-sector frameworks such as the UK Data and AI Ethics Framework (transparency, accountability, fairness, privacy, trade-offs), catalogues, lineage, IAM. Still less mature than the technical stack, often reactive after harm.

Data governance flows
Data flow diagram example, CC BY-SA 3.0, via Wikimedia Commons
Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Mark Zuckerberg stands and faces the audience as he testifies during the Senate hearing on online child sexual exploitation at the US Capitol in Washington DC. Photograph: Evelyn Hockstein/Reuters
Mark Zuckerberg stands and faces the audience as he testifies during the Senate hearing on online child sexual exploitation at the US Capitol in Washington DC. Photograph: Evelyn Hockstein/Reuters

Governance after consequences

  • Public hearings follow harm, leaks, and election controversies
  • Policy and platform changes arrive late in the pipeline
  • Reactive fixes differ from designing requirements in from the start
  • Frameworks such as the UK Data and AI Ethics Framework aim to bridge principles and practice earlier
Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

"Surveillance capitalism unilaterally claims human experience as free raw material for translation into behavioral data." (Zuboff, 2019)

Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Problem first, not ethics last

  • Ethics is often taught as a final checklist before deployment
  • In this course we place ethics, privacy, and fairness at the start of the pipeline
  • Next: define what these terms mean and where they apply
ML System?
https://xkcd.com/1838/, CC BY-NC 2.5 , via XKCD
Session 2 - Ethics, privacy, and foundations of data governance

Big Data Requirements

Pipeline template
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Universal Declaration of Human Rights (UDHR)

"We put forth the doctrine of universal human rights as a globally salient set of values for explicit value alignment in responsible AI." (Prabhakaran et al., 2022)

People affected by data systems
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

First-order challenges — two types

Afroogh et al., 2024 split value challenges facing AI systems:

  • Ethical–legal: fairness, privacy, accountability — moral and legal rights and duties
  • Multi-dimensional: trust, transparency, explainability — social acceptance, intelligibility, and confidence in the system

Both types are first-order: they shape how data projects affect people directly.

Scale of justice
Scale of justice, CC BY-SA 3.0, via Wikimedia Commons
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Map of the debate (Mittelstadt et al., 2016)

  • Discrimination and unfair treatment of groups
  • Autonomy and loss of meaningful choice
  • Non-transparency and opaque decisions
  • Privacy and unwanted exposure
  • Plus procedural, psychological, and economic harms
Accountability and law
Gavel icon, public domain, via Wikimedia Commons
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Accountability and law
Gavel icon, public domain, via Wikimedia Commons

Trade-offs as decisions are not easy

First-order values can pull in opposite directions. For example when responding to a disaster: tracing phones may support fair rescue reach but challenge privacy. (Afroogh et al., 2024)


Public-sector guidance asks teams to document tensions, consult human-rights frameworks, and revisit choices as the project evolves. (UK Data and AI Ethics Framework)

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Privacy
Lock icon, CC BY-SA 4.0, via Wikimedia Commons

Privacy (ethical–legal)

Control over appropriate flows of personal information in context. Not only secrecy, but who receives what, when, and for what purpose.

(Nissenbaum, 2004)

UDHR Art. 12 (Prabhakaran et al., 2022)

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Fairness

Fairness (ethical–legal)

Just and equitable treatment across groups. In algorithms, asks whether similar people are treated similarly and whether historical bias is reproduced.

Linked to non-discrimination (UDHR) and justice in (Mittelstadt et al., 2016)

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Accountability (ethical–legal)

Clear answer to "who is responsible for the way the system works and for harm it causes?" Includes documentation, audit trails, and remedial action.

(Prabhakaran et al., 2022)

Accountability
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Trust between people and systems

Trust (multi-dimensional)

Confidence that a data system is competent, honest, and used for legitimate purposes. Distrust grows with opaque systems, security failures, and unfair outcomes. Linked to privacy and fairness in (Afroogh et al., 2024)

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Transparency (multi-dimensional)

Making methods, data limits, and uncertainty visible to stakeholders. Can support accountability, but full openness may conflict with privacy.

Harm type «non-transparency» in (Mittelstadt et al., 2016)

Transparency
Globe icon, CC BY-SA 3.0, via Wikimedia Commons
Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Explainable system

Explainability (multi-dimensional)

Ability to describe how a system reached an output. Needed when «black box» systems affect high-stake decisions and when auditors or affected people must understand limits.

Closely tied to transparency and accountability in (Afroogh et al., 2024)

Session 2 - Ethics, privacy, and foundations of data governance

Ethical Requirements

Consent
Signature icon, CC BY-SA 3.0, via Wikimedia Commons

Consent (in practice)

Meaningful agreement to how data is collected and used. Secondary statistics and linked map data rarely re-ask each respondent. Autonomy harm in (Mittelstadt et al., 2016). Purpose limitation in (UK Data and AI Ethics Framework)

Session 2 - Ethics, privacy, and foundations of data governance

Ethics in the Pipeline

Session 2 - Ethics, privacy, and foundations of data governance

Ethics in the Pipeline

Where should we consider ethical requirements?

Only at the end, before the decision brief? Or at every stage?

Session 2 - Ethics, privacy, and foundations of data governance

Ethics in the Pipeline

Transversal

Ethics, privacy, and fairness apply across the whole pipeline, not only in analytics or the brief.

Pipeline with governance thread
Session 2 - Ethics, privacy, and foundations of data governance

Ethics in the Pipeline

Data science process
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

After collecting the data (i.e., data access), we need to perform a data assessment process to understand the data, identify and mitigate data quality issues, uncover patterns, and gain insights.

Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Data Assess Pipeline
Session 2 - Ethics, privacy, and foundations of data governance

Data Cleaning

Process of detecting and correcting (or removing) corrupt or inaccurate records.

Session 2 - Ethics, privacy, and foundations of data governance

Data Preprocessing

Process of transforming raw data into a format suitable for machine learning while ensuring data quality and consistency.

Session 2 - Ethics, privacy, and foundations of data governance

Data Augmentation

Process of increasing the size and diversity of our datasets.

Session 2 - Ethics, privacy, and foundations of data governance

Feature Engineering

Process of creating, transforming, and selecting features in our data, combining domain knowledge and creativity.

Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Data quality carries ethical requirements

  • Missing, biased, or mis-labelled data can harm groups silently
  • Assess is where we inspect structure, coverage, and risk before analytics
  • Veracity is a Big Data dimension with direct policy consequences
Data assess pipeline
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Quasi-identifiers in GEIH

  • Department, age, sex, education, occupation combinations
  • List sensitive columns before any publishable aggregate
CANDIDATOS_QI = ["DPTO", "MPIO", "AREA", "P6040", "P6240", "P6160"]
presentes = [c for c in CANDIDATOS_QI if c in df.columns]
tabla_qi = pd.DataFrame({
    "columna": presentes,
    "valores_unicos": [df[c].nunique() for c in presentes],
})
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Disclosure preview

  • Count rows in small department × age cells
  • Alert when a cell has very few respondents
work = df[["DPTO", "P6040"]].copy()
work["banda_edad"] = pd.cut(work["P6040"], bins=[0, 17, 29, 44, 59, 120])
celdas = work.groupby(["DPTO", "banda_edad"], observed=True).size()
celda_min = celdas.min()
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Join survey and map data

  • Overpass query for POIs by department
  • Compare with GEIH sample counts
  • State limits: coverage, consent, ecological fallacy
query = """
[out:json][timeout:60];
area(3601380130)->.a;
node["amenity"="school"](area.a);
out;
"""
poi_count = len(overpass_post(query).json()["elements"])
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Data science process
Session 2 - Ethics, privacy, and foundations of data governance

Data Assess

Big data pipeline
Session 2 - Ethics, privacy, and foundations of data governance

Conclusions

Session 2 - Ethics, privacy, and foundations of data governance

Conclusions

Lecture 2

  • Ethics as a crucia requirement
  • Human-rights at the center
  • Transversal governance
  • Data Assess

Lecture 3

  • Data formats
  • Data structures
  • Storage
Session 2 - Ethics, privacy, and foundations of data governance

This Week Notebooks

Session 2 - Ethics, privacy, and foundations of data governance

Week 1 Deliverables

Group zip by Monday

  • Executed week 1 group notebook
  • manifest.json
  • Map points by department file

Individual reflection by Tuesday

  • Covers Lecture 1 and Lecture 2 prompts (R1–R5)
  • Includes ungraded week feedback (R6)
  • Use the week 1 reflection template on the course page
Session 2 - Ethics, privacy, and foundations of data governance

Week 1 Deliverables

Saturday 6 June, 07:00 to 13:00 Colombia time

  • Shared block for Lecture 1 and Lecture 2
  • Reading discussion, group notebooks, project clinic, plenary
  • Full timetable: Week 1 session plan (PDF)
Session 2 - Ethics, privacy, and foundations of data governance

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true