Session 9 - The Transformer Architecture

The Transformer Architecture

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 9 - The Transformer Architecture

Last Time

Session 9 - The Transformer Architecture

Hyperparameters Tuning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons

There is not a straightforward single answer to determine the "right" hyperparameter values. However, experts follow similar steps:

  1. Become one with the data
  2. Set up the end-to-end training/evaluation skeleton
  3. Start with a simple model. For most of the tasks, a fully-connected neural network with one hidden layer.
  4. Implement a complex model that overfits and regularise
  5. Tune the hyperparameters
  6. Continue training

See more in Karpathy's recipe, and Tobin's lecture.

Session 9 - The Transformer Architecture

Reinforcement Learning

Reinforcement learning is a computational approach to learning from interaction with an environment to achieve a goal through trial and error. An agent learns to make decisions by receiving rewards or penalties (i.e., reinforcements) for its actions.

Session 9 - The Transformer Architecture

Reinforcement Learning

Search Agent Uninformed
Agent in an uninformed search problem.

The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.

Providing a reward signal is easier than providing labelled examples (i.e., supervised learning).

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.
Session 9 - The Transformer Architecture

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

The RL framework is composed of:


  • Agent: The learner and decision maker
  • Environment: The world in which the agent operates
  • State: Current situation of the environment
  • Action: What the agent can do
  • Reward: Feedback from the environment

The environment is stochastic, meaning that the outcomes of actions taken by the agent in each state are not deterministic.

Session 9 - The Transformer Architecture

Reinforcement Learning

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.

Markov Decision Process (MDP):

A mathematical framework for modelling sequential decision problems for fully observable, stochastic environments. The outcomes are partly random and partly under the control of a decision maker.

Session 9 - The Transformer Architecture

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

A MDP is a 4-tuple:



Where:


is a set of states with initial state
is a set of actions in each state
is a transition model that tells the probability of reaching , if the agent is in and performs action
is the reward function that tells the reward for every transition from to through

Session 9 - The Transformer Architecture

Reinforcement Learning


The transition model describes the outcome of each action in each state. Since the outcome is stochastic, we write:



Transitions are Markovian: The probability of reaching from depends only on and not on the history or earlier states.

Uncertainty once again brings MDPs closer to reality when compared against deterministic approaches.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 9 - The Transformer Architecture

Reinforcement Learning


From every transition the agent receives a reward:



The agent wants to maximise the sum of the received rewards (i.e., utility function):


The utility function depends on a sequence of states and actions named the environment history.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 9 - The Transformer Architecture

Reinforcement Learning


The solution for this problem is called a policy, which specifies what the agent should do for any state that the agent might reach. A policy is a mapping from states to actions that tells the agent what to do in each state:


Deterministic Policy:


Stochastic Policy:

where is the probability of taking action in state .

RL Policy
Reinforcement Learning Policy.
Session 9 - The Transformer Architecture

Reinforcement Learning


The quality of the policy in a given state is measured by the expected utility of the possible environment histories generated by that policy. We can compute the utility of state sequences using additive (discounted) rewards as follows:



where is the discount factor that determines the importance of future rewards.

The expected utility of executing the policy , starting in state , is given by:

is with respect to the probability distribution over state sequences determined by and .

Session 9 - The Transformer Architecture

Reinforcement Learning


We can compare policies at a given state using their expected utilities:


is with respect to the probability distribution over state sequences determined by and .

The goal is to select the policy that maximises the expected reward:


The policy recommends an action for every state in the sequence starting in state .

Session 9 - The Transformer Architecture

Reinforcement Learning


The utility function allows the agent to select actions by using the principle of maximum expected utility. The agent chooses the action that maximises the reward for the next step plus the expected discounted utility of the subsequent step:



The utility of a state is the expected reward for the next transition plus the discounted utility of the next state, assuming that the agent chooses the optimal action. The utility of a state
is given by:


This is called the Bellman Equation, after Richard Bellman (1957).

Session 9 - The Transformer Architecture

Reinforcement Learning

Value Function
Value Function.
Session 9 - The Transformer Architecture

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:



The Q-function tells us how good it is to take action in state .

The optimal policy can be extracted from as follows:

Session 9 - The Transformer Architecture

Reinforcement Learning

Q-Matrix
Q-Matrix
Session 9 - The Transformer Architecture

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

Model-Based RL Agent

Model-Free RL Agent

  • Knows transition model and reward function
  • Can simulate outcomes before taking actions
  • Value Iteration, Policy Iteration
  • Unknown transition model and reward function
  • Cannot simulate outcomes
  • Q-Learning, DQN
Session 9 - The Transformer Architecture

Reinforcement Learning

Policy Iteration

Q-Value

DQN

Value-function

Model-based with guaranteed convergence for finite and discrete problems.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.

Q-function

Model-free and simple for small and discrete problems.

Q-Matrix
Transition Matrix (Q).

Neural Network

Model-free and complex for large and continuous problems.



Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 9 - The Transformer Architecture

Reinforcement Learning

DQN Exercise Scenario

RL Policy
Session 9 - The Transformer Architecture

Reinforcement Learning

DQN Results
Deep Q-Network (DQN) Results - ε-decay = 0.9999
Session 9 - The Transformer Architecture

Reinforcement Learning

DQN Results
DQN Results
Deep Q-Network (DQN) Results - ε-decay = 0.9999
Session 9 - The Transformer Architecture

The Transformer Architecture

Session 9 - The Transformer Architecture

The Transformer Architecture

The Transformer is a deep neural network architecture based on the multi-head attention mechanism introduced by researchers at Google (Vaswani et al., 2017). The original goal was to improve machine learning translation tasks based on language modelling.

Session 9 - The Transformer Architecture

The Transformer Architecture

Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

The Transformer Architecture

In Natural Language Processing (NLP), language modelling includes machine learning models (i.e., deep learning) to predict the next token in a sentence.

  • Autoencoding tasks to fill in missing words (i.e., masked tokens)
  • Autoregressive tasks to generate the next token in a sentence
Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

The Transformer Architecture

The main idea is to pay attention to the context of each word in a sentence when modelling language. For example, if context is "Thanks for all the" and we want to know how likely the next word is "fish":


Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

The Transformer Architecture

The main idea is to pay attention to the context of each word in a sentence when modelling language. For example, if context is "Thanks for all the" and we want to know how likely the next word is "fish":



We want to discover the probability distribution over a vocabulary for the next word in a sequence:


where is the sequence of words previous to .

Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

The Transformer Architecture

The transformer architecture solves this problem by:

  1. Tokenisation: Convert sentence into tokens.
  2. Input and Positional Embedding: Convert input tokens into ordered embedded vectors.
  3. Self-Attention: Determine the relevance of each word to others in the sequence.
  4. Feed-Forward Neural Network: Pass the attention outputs through a feed-forward neural network to consolidate learnt patterns.
  5. Residual Connections and Layer Normalisation: Apply residual connections and layer normalisation to stabilise and improve training.
  6. Output Layer: Use a linear layer followed by a softmax function to generate the final output probabilities.
Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

1. Tokenisation: Convert sentence into tokens:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

1. Tokenisation: Convert sentence into tokens:

For example, consider the sentence: "So long and thanks for".

Tokenisation of this sentence would result in the following tokens:

  • "So"
  • "long"
  • "and"
  • "thanks"
  • "for"

Each word in the sentence is treated as an individual token.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

2. Input and Positional Embedding: Convert input tokens into ordered embedded vectors:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

2. Input and Positional Embedding: Convert input tokens into ordered embedded vectors:

Consider the tokens from the previous example: "So", "long", "and", "thanks", "for". Each token is converted into a vector using an embedding matrix. For instance:

  • "So" -> [0.1, 0.3, 0.5, 0.7]
  • "long" -> [0.2, 0.4, 0.6, 0.8]
  • "and" -> [0.3, 0.5, 0.7, 0.9]
  • "thanks" -> [0.4, 0.6, 0.8, 1.0]
  • "for" -> [0.5, 0.7, 0.9, 1.1]

The embedding matrix has as many rows as words in a predefined vocabulary and as many columns as dimensions describing a word.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

2. Input and Positional Embedding: Convert input tokens into ordered embedded vectors:

Positional encoding is then added to these vectors to incorporate the order of the tokens. For example:

  • "So" -> [0.1, 0.3, 0.5, 0.7] + [0.0, 0.1, 0.2, 0.3]
  • "long" -> [0.2, 0.4, 0.6, 0.8] + [0.1, 0.2, 0.3, 0.4]
  • "and" -> [0.3, 0.5, 0.7, 0.9] + [0.2, 0.3, 0.4, 0.5]
  • "thanks" -> [0.4, 0.6, 0.8, 1.0] + [0.3, 0.4, 0.5, 0.6]
  • "for" -> [0.5, 0.7, 0.9, 1.1] + [0.4, 0.5, 0.6, 0.7]

Positional embeddings can be defined randomly. Each position having a random representation.

The resulting vectors are used as input to the transformer model, capturing both the meaning and position of each word. These are updated at training.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

The meaning of a word represented by the embeddings is influenced by previous words. We need a mechanism (i.e., head) to transform the initial meaning of the words accordingly:



is the attention weight, indicating the importance of the word in the sequence to the word.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

The meaning of a word represented by the embeddings is influenced by previous words. We need a mechanism (i.e., head) to transform the initial meaning of the words accordingly:



is the attention weight, indicating the importance of the word in the sequence to the word.

In the self-attention mechanism, each input embedding can play three distinct roles: query, key, and value.

  • Query: As the current element compared to preceding inputs.
  • Key: As a preceding input compared to the current element.
  • Value: As a value of an element that gets weighted and summed up to compute the output of the current element.

We define three matrices to project each input into a representation of its role:


Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

The meaning of a word represented by the embeddings is influenced by previous words. We need a mechanism (i.e., head) to transform the initial meaning of the words accordingly:



reshapes the output of the head.

In the self-attention mechanism, each input embedding can play three distinct roles: query, key, and value.

  • Query: As the current element compared to preceding inputs.
  • Key: As a preceding input compared to the current element.
  • Value: As a value of an element that gets weighted and summed up to compute the output of the current element.

We define three matrices to project each input into a representation of its role:


Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

The meaning of a word represented by the embeddings is influenced by previous words. We need a mechanism (i.e., head) to transform the initial meaning of the words accordingly:



reshapes the output of the head.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

3. Self-Attention: Determine the relevance of each word to others in the sequence:

This is a multi-head attention mechanism where each head has its own set of key, query, and value matrices:



Each head focuses on different aspects of the language. One head can focus on the relationship between adjectives and nouns, another head can focus on the relation between verbs and subjects. These relationships transform the meaning of the input and are learnt from the data as model's parameters. The additional meaning is added to the original input as a residual connection.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

4. Feed-Forward Neural Network: Pass the attention outputs through a feed-forward neural network to consolidate learnt patterns:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

4. Feed-Forward Neural Network: Pass the attention outputs through a feed-forward neural network to consolidate learnt patterns:

Fully connected two-layer network



The input is first transformed by a linear layer with weights and bias . The ReLU activation function is then applied to introduce non-linearity. This is followed by another linear transformation using weights and bias . The FFN helps in learning complex patterns by combining the attention outputs in a non-linear manner.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

5. Residual Connections and Layer Normalisation:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

5. Residual Connections and Layer Normalisation:

Residual connections are used at different stages of the process to retain what the word originally meant whilst enriching it with context.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

5. Residual Connections and Layer Normalisation:

Residual connections are used at different stages of the process to retain what the word originally meant whilst enriching it with context.

Layer normalisation is applied to keep the parameter values in a range that facilitate gradient descent.



In the equation above, is the input vector, is the mean of the input, and is the standard deviation. is a small constant added for numerical stability. and are learnable parameters that scale and shift the normalised value, allowing the model to learn the optimal scale and shift for each feature.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

6. Output Layer: Use a linear layer followed by a softmax function to generate the final output probabilities:

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

6. Output Layer: Use a linear layer followed by a softmax function to generate the final output probabilities:

The linear layer applies a learnt weight matrix to the final hidden state, decoding the high-dimensional representation of the input sequence to a vector of logits, one for each possible output token.

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

6. Output Layer: Use a linear layer followed by a softmax function to generate the final output probabilities:

The linear layer applies a learnt weight matrix to the final hidden state, decoding the high-dimensional representation of the input sequence to a vector of logits, one for each possible output token.

The softmax function is applied to these logits to convert them into probabilities. The softmax function ensures that the output values are between 0 and 1 and that they sum up to 1, making them interpretable as probabilities. This step creates a probability distribution over the vocabulary.


Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

6. Output Layer: Use a linear layer followed by a softmax function to generate the final output probabilities:

The linear layer applies a learnt weight matrix to the final hidden state, decoding the high-dimensional representation of the input sequence to a vector of logits, one for each possible output token.

The softmax function is applied to these logits to convert them into probabilities. The softmax function ensures that the output values are between 0 and 1 and that they sum up to 1, making them interpretable as probabilities. This step creates a probability distribution over the vocabulary.


Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

Transformer Decoder
Transformer Decoder Architecture - (Jurafsky et al., 2025)
Session 9 - The Transformer Architecture

The Transformer Architecture

Training Process:

The training process of a Transformer model involves:

  • Loss Calculation: Compare the model's predictions to the actual outputs using a loss function, typically cross-entropy loss for classification tasks.
  • Backpropagation: Compute the gradients of the loss with respect to the model parameters to adjust the model's weights.
Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 9 - The Transformer Architecture

The Transformer Architecture

Training Process:

The training process of a Transformer model involves:

  • Optimization: An optimization algorithm, such as Adam, is used to update the model's weights based on the computed gradients.
  • Iteration: Steps 4 to 7 are repeated for a number of epochs or until the model converges to a satisfactory performance level.
Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 9 - The Transformer Architecture

The Transformer Architecture

Training Process:

The training process of a Transformer model involves:

  • Optimization: An optimization algorithm, such as Adam, is used to update the model's weights based on the computed gradients.
  • Iteration: Steps 4 to 7 are repeated for a number of epochs or until the model converges to a satisfactory performance level.

Throughout this process, various techniques such as dropout and learning rate scheduling may be employed to improve model performance and prevent overfitting.

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.
Session 9 - The Transformer Architecture

The Transformer Architecture

Training Process:

The training process of a Transformer model can involve RL techniques for specific purposes:

  • Reinforcement Learning from Human Feedback to align the model with human preferences.
  • Task-Specific RL for fine-tuning for specific objectives like dialogue quality, summarisation, etc.
RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Large Language Models (LLMs) are AI models designed to understand, generate, and manipulate human language. They are built using deep learning techniques and are usually based on the Transformer architecture and are trained on vast amounts of data to capture human language complexity. LLMs can perform a wide range of language tasks (e.g., text generation, classification, etc.).

Session 9 - The Transformer Architecture

AI History - Machine Learning Age (2001 - present)

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCarthy, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblatt, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumelhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997) Deep Learning (Hinton, 2006) Watson wins Jeopardy (2011) AlexNet (Krizhevsky, 2012) GANs (Goodfellow, 2014) AlphaGo beats Lee Sedol (DeepMind, 2016) Transformer (Vaswani, 2017) AlphaFold (DeepMind, 2018) GPT-1 (OpenAI, 2020) BERT (Google, 2019) Chinchilla (DeepMind, 2022) ChatGPT (OpenAI, 2022) LLaMA (Meta AI, 2023) Claude 2 (Anthropic, 2023) phi-3 (Microsoft, 2024) Gemini 1.5 (Google DeepMind, 2024) Qwen3 (Alibaba, 2025) R1 (DeepSeek) 2025)
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

  • GPT-3: Known for generating human-like text, it can perform tasks such as translation, question answering, and text completion.
  • BERT: Excels in understanding the context of words in a sentence, making it ideal for tasks like sentiment analysis and named entity recognition.
  • T5 (Text-to-Text Transfer Transformer): Converts all NLP tasks into a text-to-text format, enabling it to handle tasks like summarisation and translation.
  • RoBERTa: An optimised version of BERT, it improves performance on tasks like text classification and language inference.
  • ...
Transformer Architecture
Transformer Architecture - dvgodoy, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Training FLOP
Large-Scale AI Models Training - Epoch AI, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Training Cost
Estimated Cost - Stanford Institute for Human-Centered Artificial Intelligence (permission obtained by email from the AI index research manager), CC BY-SA 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Fine-tuning continues the training of a pre-trained LLM (e.g., GPT-3, BERT, etc.) to perform tasks on a particular domain (e.g., healthcare).

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Fine Tuning Process
Fine Tuning Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

It is similar to a neural network training:

  1. Data Collection: Gather a large and relevant dataset for the specific domain or task.
  2. Pre-processing: Clean and pre-process the data to ensure it is in a suitable format for training.
  3. Model Selection: Choose a pre-trained LLM that is most suitable for the task at hand.
  4. Supervised Learning: Prompt engineering, error calculation, and adjusting weights using gradient descent and backpropagation.
  5. Evaluation: Assess the performance of the fine-tuned model using appropriate metrics and validation datasets.
  6. Deployment: Deploy the fine-tuned model for use in real-world applications.
Fine Tuning Process
Fine Tuning Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

It is similar to a neural network training:

  1. Data Collection: Gather a large and relevant dataset for the specific domain or task.
  2. Pre-processing: Clean and pre-process the data to ensure it is in a suitable format for training.
  3. Model Selection: Choose a pre-trained LLM that is most suitable for the task at hand.
  4. Supervised Learning: Prompt engineering, error calculation, and adjusting weights using gradient descent and backpropagation.
  5. Evaluation: Assess the performance of the fine-tuned model using appropriate metrics and validation datasets.
  6. Deployment: Deploy the fine-tuned model for use in real-world applications.
Data Science Process
Data Science Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Data Assess Pipeline
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

  • Dataset Size: Typically requires tens of gigabytes to terabytes of data.
  • RAM: Depends on the model size (i.e., number of parameters). At least 64GB or 128GB of RAM.
  • GPU: High-performance GPUs such as NVIDIA A100 or V100 for faster training times.
  • CPU: Multi-core processors, ideally with 16 cores or more, to handle data preprocessing and other tasks.
  • Disk Space: Sufficient storage, often in the range of several terabytes, to accommodate datasets and model checkpoints.
  • Network Bandwidth: High-speed internet connection for downloading datasets and model updates.
Fine Tuning Process
Fine Tuning Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Retrieval-Augmented Generation (RAG) is an alternative to fine-tuning that combines pre-trained LLMs with external knowledge sources. Instead of adapting the model to a specific domain, RAG retrieves relevant information from a database or knowledge base to enhance the model's responses in real-time.

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

The process combines approaches from symbolic AI and databases:

  1. Data Collection: Gather a knowledge base that the RAG system can query.
  2. Pre-processing: Organise the knowledge base to ensure efficient retrieval and LLM integration.
  3. Model Selection: Choose a pre-trained LLM that can integrate with the retrieval system.
  4. Retrieval Integration: Using the knowledge base and the LLM in response to queries.
  5. Evaluation: Assess the performance of the RAG system by using appropriate metrics.
  6. Deployment: Deploy the RAG system for real-time applications, ensuring it can access and retrieve information efficiently.
RAG Process
RAG Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

The process combines approaches from symbolic AI and databases:

  1. Data Collection: Gather a knowledge base that the RAG system can query.
  2. Pre-processing: Organise the knowledge base to ensure efficient retrieval and LLM integration.
  3. Model Selection: Choose a pre-trained LLM that can integrate with the retrieval system.
  4. Retrieval Integration: Using the knowledge base and the LLM in response to queries.
  5. Evaluation: Assess the performance of the RAG system by using appropriate metrics.
  6. Deployment: Deploy the RAG system for real-time applications, ensuring it can access and retrieve information efficiently.
Data Science Process
Data Science Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Data Assess Pipeline
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

  • Knowledge Base Size: A comprehensive knowledge base can range from several gigabytes to terabytes.
  • RAM: A minimum of 64GB is recommended to facilitate retrieval operations.
  • GPU: Mid-range GPUs can suffice, but high-performance options like NVIDIA A100 can enhance performance.
  • CPU: Multi-core processors for managing data preprocessing and retrieval tasks.
  • Disk Space: Several terabytes may be necessary, depending on the data volume.
  • Network Bandwidth: A high-speed internet connection for accessing external data sources and updating the knowledge base as needed.
RAG Process
RAG Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Prompt Engineering is a lightweight alternative to fine-tuning and RAG. Instead of changing model weights or building a retrieval pipeline, we craft instructions, examples, and constraints (the “prompt”) so that a frozen LLM performs the desired task.

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Typical prompt-engineering workflow:

  1. Task definition: Specify what output format and style you need.
  2. Baseline prompt: Write a clear instruction (zero-shot) or add 1-5 examples (few-shot).
  3. Iterate and test: Evaluate outputs, add system messages, or reorder examples to reduce errors and bias.
  4. Guardrails: Include refusals, safety clauses, or value alignment statements.
  5. Automation: Use prompt templates or tools like LangChain/LlamaIndex to inject dynamic context.
  6. Deployment: Store the prompt with version control and monitor performance over time.
Prompt Engineering Process
Prompt Engineering Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

expected_format = """
Return your answer in JSON with the following keys:
{
  "title": string,          # concise headline (≤ 12 words)
  "summary": [string, ...]  # 3–5 bullet points
}
"""

article = """<ARTICLE TEXT HERE>"""

prompt = f"""
You are a helpful assistant.

TASK: Summarise the article below.

OUTPUT FORMAT (baseline)
{expected_format}

ARTICLE
""" + article
Prompt Engineering Process
Prompt Engineering Process
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

import gemini

gemini.api_key = "YOUR_API_KEY"

response = gemini.Completion.create(
    engine="gemini-001",
    prompt=prompt,
    max_tokens=150,
    temperature=0.7
)
print(response.choices[0].text.strip())
Prompt Engineering Process
Prompt Engineering Process
Session 9 - The Transformer Architecture

Agentic AI

Session 9 - The Transformer Architecture

Agentic AI

Agentic AI refers to systems that possess the capability to make autonomous decisions and take actions to achieve specific goals. This concept appeared with the emergence of LLMs and Generative AI.

Session 9 - The Transformer Architecture

Agentic AI

AI Tasks Capabilities
AI Tasks Capabilities - METR, CC BY 4.0 , via Wikimedia Commons
Session 9 - The Transformer Architecture

Agentic AI

Session 9 - The Transformer Architecture

Agentic AI

Agent
AI Agent

AI Agents have existed for decades, with active research communities and open challenges:

Session 9 - The Transformer Architecture

Agentic AI

Agent
AI Agent

AI Agents have existed for decades, with active research communities and open challenges:

  • Reactive Agents: These agents perceive their environment and respond to changes.
  • Deliberative Agents: These agents use symbolic reasoning and planning to make decisions.
  • Hybrid Agents: Combining reactive and deliberative approaches.
  • Multi-agent Systems: Systems where multiple agents interact or work together.
  • Self-adaptive Systems: These systems can modify their behaviour in response to changes in their environment or internal state.

The difference is that LLM-based agents make the decision in Agentic AI frameworks.

Session 9 - The Transformer Architecture

Agentic AI

Agent
AI Agent

AI Agents have existed for decades, with active research communities and open challenges:

  • Complex decision making processes
  • Uncertainty management
  • Agent coordination and scalability
  • Algorithms robustness in dynamic environments
  • Decision-making guarantees
  • Control theory
  • Social impact
  • ...

The inclusion of LLM-based agents exacerbates some of these challenges.

Session 9 - The Transformer Architecture

Agentic AI

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

How sensitive are LLMs to changes in prompts when generating code?


Session 9 - The Transformer Architecture

Large Language Models (LLMs)

How sensitive are LLMs to changes in prompts when generating code?


'''
Write Python code to solve the following problem:

Given two arrays nums1 and nums2 of size m and n
respectively, return the medians of the two
arrays. The overall run time complexity should be
O(log (m+n)).
Constraints:
0 <= m <= 1000
0 <= n <= 1000
1 <= m + n <= 2000
-106 <= nums1[i], nums2[i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

How sensitive are LLMs to changes in prompts when generating code?


'''
Write Python code to solve the following problem:

Given two arrays nums1 and nums2 of size m and n
respectively, return the medians of the two
arrays. The overall run time complexity should be
O(log (m+n)).
Constraints:
0 <= m <= 1000
0 <= n <= 1000
1 <= m + n <= 2000
-106 <= nums1[i], nums2[i] <= 106
'''

One change in the prompt was obfuscating the text by adding random noise:


'''
Write Python code to solve the following problem:

hJFFm two a8Gh4E se5Zuz GIJe! and mj<Cq of q&Xs m
and n %FE)2Xt(G#Oy, t$5 tDh the k$rJQH of the two
EPEˆ@W xGeX %E. The (gSFq<: run F(K@ DkN(;ss9r7 W
Bij>v be O (log (m + n) ). 
K@jb$T = = n 
0 <= m <= w000 
0 <= n <= w000 
1 <= m + n <= 1000 
- 106 <= GHnZ@ [i ], jk,e@ [i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

One change in the prompt was obfuscating the text by adding random noise:


'''
Write Python code to solve the following problem:

hJFFm two a8Gh4E se5Zuz GIJe! and mj<Cq of q&Xs m
and n %FE)2Xt(G#Oy, t$5 tDh the k$rJQH of the two
EPEˆ@W xGeX %E. The (gSFq<: run F(K@ DkN(;ss9r7 W
Bij>v be O (log (m + n) ). 
K@jb$T = = n 
0 <= m <= w000 
0 <= n <= w000 
1 <= m + n <= 1000 
- 106 <= GHnZ@ [i ], jk,e@ [i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)

'''
Write Python code to solve the following problem:

hJFFm two a8Gh4E se5Zuz GIJe! and mj<Cq of q&Xs m
and n %FE)2Xt(G#Oy, t$5 tDh the k$rJQH of the two
EPEˆ@W xGeX %E. The (gSFq<: run F(K@ DkN(;ss9r7 W
Bij>v be O (log (m + n) ). 
K@jb$T = = n 
0 <= m <= w000 
0 <= n <= w000 
1 <= m + n <= 1000 
- 106 <= GHnZ@ [i ], jk,e@ [i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

'''
Write Python code to solve the following problem:

hJFFm two a8Gh4E se5Zuz GIJe! and mj<Cq of q&Xs m
and n %FE)2Xt(G#Oy, t$5 tDh the k$rJQH of the two
EPEˆ@W xGeX %E. The (gSFq<: run F(K@ DkN(;ss9r7 W
Bij>v be O (log (m + n) ). 
K@jb$T = = n 
0 <= m <= w000 
0 <= n <= w000 
1 <= m + n <= 1000 
- 106 <= GHnZ@ [i ], jk,e@ [i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

'''
Write Python code to solve the following problem:

hJFFm two a8Gh4E se5Zuz GIJe! and mj<Cq of q&Xs m
and n %FE)2Xt(G#Oy, t$5 tDh the k$rJQH of the two
EPEˆ@W xGeX %E. The (gSFq<: run F(K@ DkN(;ss9r7 W
Bij>v be O (log (m + n) ). 
K@jb$T = = n 
0 <= m <= w000 
0 <= n <= w000 
1 <= m + n <= 1000 
- 106 <= GHnZ@ [i ], jk,e@ [i] <= 106
'''
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)


What can we conclude?


  • Are LLMs smarter than humans as they produce the correct answer even with obfuscated text?
Session 9 - The Transformer Architecture

Large Language Models (LLMs)


What can we conclude?


  • Are LLMs smarter than humans as they produce the correct answer even with obfuscated text?
  • Can LLM agents create their own language to fool us and dominate the world?
Session 9 - The Transformer Architecture

Large Language Models (LLMs)


What can we conclude?


  • Are LLMs smarter than humans as they produce the correct answer even with obfuscated text?
  • Can LLM agents create their own language to fool us and dominate the world?
  • ...
Session 9 - The Transformer Architecture

Large Language Models (LLMs)

Session 9 - The Transformer Architecture

Large Language Models (LLMs)


The actual conclusions are a bit more boring:


  • LLMs show signals of overfitting. They memorise the training data and generate text accordingly. They relate any pattern in the input with the training set and assign probabilities to next tokens.
Session 9 - The Transformer Architecture

Large Language Models (LLMs)


The actual conclusions are a bit more boring:


  • LLMs show signals of overfitting. They memorise the training data and generate text accordingly. They relate any pattern in the input with the training set and assign probabilities to next tokens.
  • LLM agents can communicate and exchange messages in formats humans cannot read. It threatens how we control autonomous systems as these are not transparent.
Session 9 - The Transformer Architecture

Large Language Models (LLMs)


The actual conclusions are a bit more boring:


  • LLMs show signals of overfitting. They memorise the training data and generate text accordingly. They relate any pattern in the input with the training set and assign probabilities to next tokens.
  • LLM agents can communicate and exchange messages in formats humans cannot read. It threatens how we control autonomous systems as these are not transparent.
  • We would expect a human to point out the issues in the text before even trying to provide an answer. Who is right?
Session 9 - The Transformer Architecture

Conclusions

Session 9 - The Transformer Architecture

Conclusions

Session 9 - The Transformer Architecture

Conclusions

Overview

  • Transformer Architecture
  • Self-Attention Mechanism
  • Multi-Head Attention
  • Large Language Models
  • Fine-tuning, RAG, and Prompting
  • Agentic AI
Session 9 - The Transformer Architecture

Conclusions

Overview

  • Transformer Architecture
  • Self-Attention Mechanism
  • Multi-Head Attention
  • Large Language Models
  • Fine-tuning, RAG, and Prompting
  • Agentic AI

Next Time

  • ML Model Deployment
  • MLOps
  • AI as a Service
  • AI-based Systems
Session 9 - The Transformer Architecture

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true