Data ingestion and workflow

Description: Lecture 5 introduces batch and stream ingestion approaches. Once our data is harmonised on a lakehouse, we need to create data artefacts and views to feed our analytic tasks. These artefacts can be created and processed offline following a schedule (i.e., batch) or in real-time (i.e., streaming) depending on the data nature. This lecture introduces both concepts and the production platforms and tools that support them.
Department: Departamento de Matemáticas y Estadística - Facultad de Ciencias Exactas y Naturales
Institution: Universidad de Nariño
Date: June 20, 2026
Hours: 5
From: 07:00 am
To: 01:00 pm

HTML slides | Notebook - Individual | Notebook - Group | Back to course

Week 3 links

Week 3 hub

Resources

Introductory Python course (optional video)
Apache Kafka documentation (reference)

References

Zaharia, M., et al. (2016). Apache Spark: a unified engine for big data processing. CACM, 59(11), 56–65.
Jarrahi, M. H., et al. (2023). The Principles of Data-Centric AI. (Course PDF.)
Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: a distributed messaging system for log processing. (Recommended async.)