Session 6 - Analytics and visualisation

Analytics and visualisation

Christian Cabrera Jojoa

Assistant Research Professor

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 6 - Analytics and visualisation

Last Time

Session 6 - Analytics and visualisation

Last Time

Batch ingestion

  • Bounded data, processed on a schedule
  • ETL steps wired into an orchestrated flow
  • Governance travels with it: audit log, schema contract
ETL versus ELT
Session 6 - Analytics and visualisation

Last Time

Stream ingestion

  • Unbounded events, processed as they arrive
  • Publish / subscribe decouples producers and consumers
  • A partitioned log gives parallelism and replay
Publish/subscribe streaming architecture
Session 6 - Analytics and visualisation

Analytics

Session 6 - Analytics and visualisation

Data Analytics

We can now move data in and store it. We need to turn it into evidence: visualisation, dashboards, and machine learning.

Session 6 - Analytics and visualisation

Data Analytics

Big data pipeline, analytics stage
Session 6 - Analytics and visualisation

Data Analytics

Analytics is its own stage of the pipeline

  • Visualise: charts and maps that humans can read
  • Dashboard: several views for one decision audience
  • Model: predict or summarise with limits and governance
Big data pipeline, analytics stage
Session 6 - Analytics and visualisation

Data Analytics

From processing to decision

  • Assess: trust the curated layer that defines the schema, coverage, bias
  • Address: answer the decision question with labelled evidence
Data science methodology: Access, Assess, Address
Session 6 - Analytics and visualisation

Visualisation

Session 6 - Analytics and visualisation

Visualisation

A chart is an argument for a decision. Visualisation turns numbers into something a person can reason about.

Session 6 - Analytics and visualisation

Visualisation

Choose the chart for the question

  • Trend over time? a line chart
  • Compare categories? a bar chart
  • Relationship between two variables? a scatter plot
  • Distribution? a histogram or box plot
Normal Distribution Histogram Normal Probability Distribution
Normal Probability Plot - Visnut, CC BY 3.0 , via Wikimedia Commons
Session 6 - Analytics and visualisation

Visualisation

Honest charts

  • Start axes at a sensible baseline; do not exaggerate
  • Label units, the source, and the date
  • Show uncertainty when it matters

A misleading chart is a governance failure, not a style choice.

Normal Distribution Histogram Normal Probability Distribution
Normal Probability Plot - Visnut, CC BY 3.0 , via Wikimedia Commons
Session 6 - Analytics and visualisation

Visualisation

A dashboard brings several charts together for an audience. What you choose to show and what you hide is a governance decision.

Session 6 - Analytics and visualisation

Visualisation

Evidence layers

  • Official layer: the result we stand behind
  • Context layer: monitoring and exploration, clearly labelled not official

Governance controls

  • Different audiences see different panels
  • Suppress small groups to protect people
  • State the source and refresh date on every panel
Cities-Board
Cities-Board: A Framework to Automate the Development of Smart Cities Dashboards. (Rojas, 2020)

Session 6 - Analytics and visualisation

Visualisation

Two layers, one figure

  • Each panel names its layer
  • The context panel is marked not official
  • The title carries the source
from plotly.subplots import make_subplots
import plotly.graph_objects as go

dash = make_subplots(rows=2, cols=1, subplot_titles=[
    "Official layer: observed vs predicted",
    "Context layer: monitoring (not official)",
])
dash.add_trace(go.Scatter(x=y_val, y=pred, mode="markers"),
               row=1, col=1)
dash.add_trace(go.Bar(x=["events stored"], y=[n_context]),
               row=2, col=1)
dash.update_layout(
    title_text="Evidence dashboard - official + context",
)
Session 6 - Analytics and visualisation

Data Address

Session 6 - Analytics and visualisation

Data Address

Data science methodology: Access, Assess, Address
Session 6 - Analytics and visualisation

Data Address

After assessing the data (i.e., data assess), we need to use the data to address the problem in question. This process includes implementing a Machine Learning algorithm that creates a Machine Learning model.

Session 6 - Analytics and visualisation

Data Address

Data Assess Pipeline
Session 6 - Analytics and visualisation

Data Address

A Machine Learning algorithm is a set of instructions that are used to train a machine learning model. It defines how the model learns from data and makes predictions or decisions. Linear regression, decision trees, and neural networks are examples of machine learning algorithms.

ML Algorithm
Session 6 - Analytics and visualisation

Data Address

A Machine Learning model is a program that is trained on a dataset and used to make predictions or decisions. The goal is to create a trained model that can generalise well to new, unseen data. For example, a trained model could predict house prices based on new input features.

ML Model
Session 6 - Analytics and visualisation

Data Address

A Machine Learning algorithm uses the training process that goes from a specific set of observations to a general rule (i.e., induction). This process adjusts the Machine Learning model internal parameters to minimise prediction errors. For example, in a linear regression model, the algorithm adjusts the slope and intercept to minimise the difference between predicted and actual values.

Training Process
Session 6 - Analytics and visualisation

Data Address

The parameters of Machine Learning models are adjusted according to the training data (i.e., seen data). For example, when training a model to predict house prices, the training data would include features like square footage, number of bedrooms, and location, along with their actual sale prices. The training data consists of vectors of attribute values.

Training Process
Session 6 - Analytics and visualisation

Data Address

In classification problems, the prediction (i.e., model's output) is one of a finite set of values (e.g., sunny/cloudy/rainy or true/false). In the regression problems, the model's output is a number.

ML Model
Session 6 - Analytics and visualisation

Data Address

Predictions can deviate from the expected values. Prediction errors are quantified by a loss function that indicates to the algorithm how far the prediction is from the target value. This difference is used to update the machine learning model's internal parameters to minimise the prediction errors. Common examples include Mean Squared Error for regression problems and Cross-Entropy for classification problems.

Training Process
Session 6 - Analytics and visualisation

Data Address

Three types of feedback can be part of the inputs in our training process:

  • Supervised Learning: The model is trained on labeled data to learn a mapping between inputs and the corresponding labels, so the model can make predictions on unseen data.
  • Unsupervised Learning: The model is trained on unlabeled data, and it must find patterns in the data. The goal is to identify hidden structures or groupings in the data.
  • Reinforcement Learning: The model learns by interacting with an environment and receiving rewards or penalties. The goal is to learn a policy that maximizes the reward.
Training Process
Session 6 - Analytics and visualisation

Supervised Learning

Session 6 - Analytics and visualisation

Supervised Learning

The agent observes input-output pairs and learns a function that maps from input to output (label).

Session 6 - Analytics and visualisation

Supervised Learning


Given a training set of N example input-output pairs



Each pair was generated by an unknown function f



The goal is to discover a function h that approximates the true function f.

Session 6 - Analytics and visualisation

Supervised Learning


h is called a hypothesis. We need to search for h in a hypotheses space H of possible functions



A consistent hypothesis maps each input to the corresponding grounded truth



We cannot expect exact match to the ground truth. We look for a best-fit function that generalises well.


The hypothesis h accurately predicts the outputs of unseen inputs (i.e., test set).

Session 6 - Analytics and visualisation

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. Future examples will be like past examples (i.e., stationary assumption)


Each example has a the same prior probability distribution



Each example is independent of previous examples



Examples that satisfy these equations are independent and identically distributed (i.e., iid).

Session 6 - Analytics and visualisation

Supervised Learning


Our goal is to select a hypothesis h that will optimally fit future examples. h is optimally fit if it minimises the error rate


The error rate is the proportion of times that h produces the wrong output for an example



Finding a good hypothesis implies choosing a good hypothesis space H and optimising or finding the best hypothesis h at training.

Session 6 - Analytics and visualisation

Supervised Learning


Loss functions quantify the difference between predicted values and expected values (i.e., grounded truth)


Absolute value of the difference


Squared difference


Zero-One loss (classification error)

Session 6 - Analytics and visualisation

Supervised Learning

Underfitted Model
Underfitted Model - AAStein, CC BY-SA 4.0 , via Wikimedia Commons

Underfitting occurs when our hypothesis space H is too simple to capture the true function f


Even the best hypothesis h* in H will have high error because:



where epsilon is some small positive number.

Session 6 - Analytics and visualisation

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Overfitting occurs when our hypothesis space H is too complex, leading to:


Low empirical loss but high generalisation loss:



The hypothesis h* memorizes the training examples instead of learning the true function f.

Session 6 - Analytics and visualisation

Supervised Learning

Overfitting
Overfitting - ThirdOrderLogic, CC BY 4.0 , via Wikimedia Commons

Regularisation transforms the loss function into a cost function that penalizes complexity to avoid overfitting


Choosing the complexity function depends on the hypotheses space H. A good example for polynomials will be a function that returns the sum of the squares of coefficients.

Session 6 - Analytics and visualisation

Supervised Learning

Given the new cost function



The best hypothesis h* is the one that minimises the cost:


where λ is a hyperparameter that controls the trade-off between fitting the data and model complexity and serves as a conversion rate.


Alternatives to mitigate overfitting are feature selection, hyperparameter tuning, splitting dataset in training, validation, and test data (e.g., cross validation algorithm).

Session 6 - Analytics and visualisation

Supervised Learning

The regression problem involves predicting a continuous numerical value. Regression models approximate a function f that maps input features to a continuous output.

Session 6 - Analytics and visualisation

Supervised Learning

Regression Fit
Linear regression fit.

Linear Regression

Training Dataset

Hypothesis Space: All possible linear functions of continuous-valued inputs and outputs.

Hypothesis:

Loss Function:

Cost Function:

Session 6 - Analytics and visualisation

Supervised Learning

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Linear Regression

Analytical Solution:

Gradient Descent Algorithm:


Initialize w randomly
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

Hyperparameters: Learning rate, number of epochs, and batch size.

Session 6 - Analytics and visualisation

Supervised Learning

Linear classifiers are models that separate data into classes using a linear decision boundary. These classifiers are functions that can decide if an input (i.e., vectors of numbers) belong to a specific class or not.

Session 6 - Analytics and visualisation

Supervised Learning

SVM Separating Hyperplanes
SVM Separating Hyperplanes - Cyc, Public domain, via Wikimedia Commons.

A decision boundary is a line (or a surface in higher dimensions) that separates data into classes.


The hypothesis is the result of passing a linear function through a threshold function:


Regression Fit
Threshold Function.

Session 6 - Analytics and visualisation

Supervised Learning


The optimal hypothesis is the one that minimises the loss function:



Where is the hypothesis space.


Partial derivatives and gradient methods do not work. Instead, we apply perceptron learning rule.

Regression Fit
Threshold Function.

Session 6 - Analytics and visualisation

Supervised Learning


Perceptron update rule:



The rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent).

Regression Fit
Threshold Function.

Session 6 - Analytics and visualisation

Supervised Learning


An alternative to the discrete threshold function is the logistic function, which is smooth and differentiable. The process of finding the optimal hypothesis is called logistic regression.




Where is the hypothesis space.

Logistic Curve
Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Session 6 - Analytics and visualisation

Supervised Learning

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

Gradient Descent for Logistic Regression

The gradient of the loss function is:



Update rule for :



Where is the learning rate.

Session 6 - Analytics and visualisation

Supervised Learning

Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.

Session 6 - Analytics and visualisation

Supervised Learning

Two-layer neural network
Two-layer neural network — (Bishop, 2006).

Why go beyond linear?

  • Linear / logistic: fast, interpretable, limited shape
  • Small MLP: more flexible, harder to explain
  • Training is costly (many weights); inference is cheap once trained
Session 6 - Analytics and visualisation

Supervised Learning

Two-layer neural network
Two-layer neural network — (Bishop, 2006).

Architecture (feedforward)

  • Input layer: features (e.g. department, month, year)
  • Hidden layer(s): weighted sum + nonlinear activation
  • Output layer: prediction (e.g. employment rate)

Information flows forward: inputs → hidden units → output.

Session 6 - Analytics and visualisation

Supervised Learning

Gradient descent
Gradient descent — Jacopo Bertolotti, CC0, via Wikimedia Commons.

Training a neural network

  • Feedforward: pass inputs forward layer by layer, applying weights and activation functions, to compute the final prediction
  • Loss Function: same idea as before compare the prediction with the actual value using the loss function
  • Backpropagation applies the chain rule to compute gradients through the layers efficiently

Hyperparameters: hidden size, learning rate, max iterations.

Session 6 - Analytics and visualisation

Supervised Learning

Precision vs explainability

  • The neural network may be more accurate
  • The linear model is easier to explain to a decision-maker
  • For a public decision, explainability often wins

No model removes uncertainty — there is no Laplace demon. We report what the model supports and what it cannot.

Regression Fit Two-layer neural network
Linear Models vs Neural Networks
Session 6 - Analytics and visualisation

Supervised Learning

Train, validate, test

  • Train: fit the model
  • Validation: choose between models
  • Test: judge once, at the end
from sklearn.model_selection import train_test_split

# hold out the test set first
X_rest, X_test, y_rest, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

# split the rest into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_rest, y_rest, test_size=0.25, random_state=42)
Session 6 - Analytics and visualisation

Supervised Learning

Start simple: a linear model

  • Transparent and fast
  • Report and MAE on validation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

linear = LinearRegression().fit(X_train, y_train)
pred = linear.predict(X_val)

print("R2 :", r2_score(y_val, pred))
print("MAE:", mean_absolute_error(y_val, pred))
Session 6 - Analytics and visualisation

Supervised Learning

More flexible: a small neural network

  • Scale the inputs first (learn scale on train only)
  • An MLP can fit non-linear shapes
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

scaler = StandardScaler().fit(X_train)
mlp = MLPRegressor(hidden_layer_sizes=(8,),
                   max_iter=500, random_state=42)
mlp.fit(scaler.transform(X_train), y_train)
pred = mlp.predict(scaler.transform(X_val))
Session 6 - Analytics and visualisation

Supervised Learning

Choose, then judge once

  • Pick the model with lower validation error
  • Report test error only at the end
# choose on validation, judge once on test
if mae_linear <= mae_mlp:
    model, name = linear, "linear"
else:
    model, name = mlp, "mlp"

print("chosen:", name)
print("test R2:", r2_score(y_test, model.predict(X_test_ready)))
Session 6 - Analytics and visualisation

Conclusions

Session 6 - Analytics and visualisation

Conclusions

  • Analytics turns ingested data into evidence
  • Dashboards are governance: layers, audiences, suppression
  • Data address can include prediction models
  • Supervised learning
Data science methodology: Access, Assess, Address
Big data pipeline, analytics stage
Session 6 - Analytics and visualisation

This Week Notebooks

Session 6 - Analytics and visualisation

Next Week Project

Deliverables

  • Group presentation in our next session: ~20 min (slides + live demo) and ~10 min Q&A.
  • Zip file with code that implements the pipeline, PDF project report, and governance files.
  • Individual reflection on final project
Session 6 - Analytics and visualisation

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true