Session 5 - Data ingestion and workflow

Batch Processing

A Prefect flow

Each @task is one ETL step
The @flow chains them in order
Prefect logs each run and can retry

from prefect import flow, task

@task
def extract(paths):              # Extract: open partitions (lazy)
    return scan_parquet(paths)

@task
def aggregate(frame, output):    # Transform + Load: write curated layer
    result = (frame.filter(...)
                   .group_by(keys).agg(...).collect())
    result.write_parquet(output)
    return output

@task
def validate(path):              # Validate: the output must have rows
    assert read_parquet(path).height > 0
    return path

@flow(name="batch_ingest")
def batch_flow(paths, output):   # one flow chains the steps
    path = aggregate(extract(paths), output)
    return validate(path)

Data ingestion and workflow

Last Time

Last Time

Last Time

Last Time

Last Time

Last Time

Last Time

Last Time

Data Ingestion

Data Ingestion

Data Ingestion

Data Ingestion

Batch and ETL

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Batch Processing

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Streams

Conclusions

Conclusions

This Week Notebooks

Many Thanks!