Session 4 - Data processing and analysis

MapReduce in R

By hand: one file at a time

Map: read one partition, keep employed
Reduce: group_by per file
lapply over files is the map step

Same pattern as the Python notebook, in R syntax.

map_partition <- function(path) {
  df <- read_parquet(path,
    col_select = c("dpto", "actividad", "factor_expansion"))
  df[df$actividad == 1, c("dpto", "factor_expansion")]  # emit
}

reduce_part <- function(emp) {
  emp %>% group_by(dpto) %>%
    summarise(count = n(), sum = sum(factor_expansion))
}

parts  <- lapply(files, \(f) reduce_part(map_partition(f)))
result <- bind_rows(parts)   # then average by dpto

Data processing and analysis

Last Time

Last Time

Last Time

Last Time

The Need to Process

The Need to Process

The Need to Process

The Need to Process

The Need to Process

The Need to Process

The Need to Process

Big Data Processing

MapReduce

MapReduce

MapReduce

MapReduce in R

MapReduce in R

MapReduce

Query Engines

Query Engines

Query Engines

Query Engines

Query Engines

Query Engines

Query Engines

Query Engines

Governance

Governance

Governance

Governance

Assess and Address

Data Science Methodology

Data Science Methodology

Data Assess

Data Address

Data Architecture

Data Architecture

Data Architecture

Data Architecture

Conclusions

Conclusions

This Week Notebooks

Many Thanks!