Session 8 - Reinforcement Learning

Reinforcement Learning

Christian Cabrera Jojoa

Senior Research Associate and Affiliated Lecturer

Department of Computer Science and Technology

University of Cambridge

chc79@cam.ac.uk

Session 8 - Reinforcement Learning

Last Time

Session 8 - Reinforcement Learning

ML Definition

Session 8 - Reinforcement Learning

The Data Science Process

Data Science Process
Session 8 - Reinforcement Learning

Machine Learning Pipeline

Data Assess Pipeline
Session 8 - Reinforcement Learning

The Perceptron

Perceptron Overview
Perceptron Overview - Andreas Maier, CC BY 4.0 , via Wikimedia Commons
Session 8 - Reinforcement Learning

Neural Networks

Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.

Session 8 - Reinforcement Learning

Neural Networks

Neural Network
Neural Network Architecture.

Neural Networks:


  • Fix the number of basis functions in advance
  • Basis functions are adaptive and their parameters can be updated during training

Training is costly but inference is cheap.

Session 8 - Reinforcement Learning

Neural Networks

Neural Network
Neural Network Architecture.

Key Components:

  • Input Layer: Receives the input features
  • Hidden Layers: Process information through weighted connections
  • Output Layer: Produces the final prediction
  • Activation Functions: Introduce non-linearity
Session 8 - Reinforcement Learning

Neural Networks

Feedforward Neural Network:

The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:


The bias parameters can be absorbed:


are continuous functions. The neural network is differentiable with respect to the parameters .

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 8 - Reinforcement Learning

Neural Networks

Sigmoid Function:


Range:

Derivative:


Logistic Curve

Logistic Curve - Qef, Public domain, via Wikimedia Commons.

Hyperbolic Tangent:


Range:

Derivative:


Hyperbolic Tangent

Hyperbolic Tangent - Geek3, CC BY-SA 3.0 , via Wikimedia Commons.

ReLU (Rectified Linear Unit):


Range:

Derivative:


Ramp Function

Ramp Function - Qef, Public domain, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Neural Networks

Training Process:


Given a training set of N example input-output pairs



Each pair was generated by an unknown function :


We want to find a hypothesis that minimises the error function:

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).
Session 8 - Reinforcement Learning

Neural Networks

Gradient Descent Algorithm
Gradient Descent Algorithm - Jacopo Bertolotti, CC0, via Wikimedia Commons.

We choose any starting point and then compute an estimate of the gradient and move a small amount in the steepest downhill direction, repeating until we converge on a point in the weight space with (local) minima loss.


Gradient Descent Algorithm:
Initialise w
repeat
    for each w[i] in w
        Compute gradient: g = ∇Loss(w[i])
        Update weight:   w[i] = w[i] - α * g
until convergence

The size of the step is given by the parameter α, which regulates the behaviour of the gradient descent algorithm. This is a hyperparameter of the regression model we are training, usually called learning rate.

Session 8 - Reinforcement Learning

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Error Backpropagation Algorithm:

1. Forward Pass: Compute all activations and outputs for an input vector.

2. Error Evaluation: Evaluate the error for all the outputs using:

3. Backward Pass: Backpropagate errors for each hidden unit in the network using:

4. Derivatives Evaluation: Evaluate the derivatives for each parameter using:

Session 8 - Reinforcement Learning

Neural Networks

Two-layer Neural Network
Two-layer Neural Network - (Bishop, 2006).

Gradient Descent Update Rule:




Where is the learning rate.

Session 8 - Reinforcement Learning

Deep Learning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Hyperparameters Tuning

Session 8 - Reinforcement Learning

Hyperparameter Tuning

Session 8 - Reinforcement Learning

Hyperparameter Tuning

Hyperparameters tuning involves the process of optimising the parameters that govern the training process of a machine learning model, such as learning rate, number of layers, batch size, and number of epochs, to improve its performance and accuracy.

Session 8 - Reinforcement Learning

Hyperparameter Tuning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons

There is not a straight single answer to determine the "right" hyperparameter values.

Session 8 - Reinforcement Learning

Hyperparameter Tuning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons

There is not a straight single answer to determine the "right" hyperparameter values. But, experts follow similar steps:

  1. Become one with the data
  2. Set up the end-to-end training/evaluation skeleton
  3. Start with a simple model. For most of the tasks, a fully-connected neural network with one hidden layer.
  4. Implement a complex model that overfits and regularise
  5. Tune the hyperparameters
  6. Continue training
Session 8 - Reinforcement Learning

Hyperparameter Tuning

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons

There is not a straight single answer to determine the "right" hyperparameter values. But, experts follow similar steps:

  1. Become one with the data
  2. Set up the end-to-end training/evaluation skeleton
  3. Start with a simple model. For most of the tasks, a fully-connected neural network with one hidden layer.
  4. Implement a complex model that overfits and regularise
  5. Tune the hyperparameters
  6. Continue training

See more in Karpathy's recipe, and Tobin's lecture.

Session 8 - Reinforcement Learning

Hyperparameter Tuning

In the process, you should combine the methods previously discussed like preprocessing, feature engineering, normalising inputs, parameters initialisation, etc.

Session 8 - Reinforcement Learning

Hyperparameter Tuning

In the process, you should combine the methods previously discussed like preprocessing, feature engineering, normalising inputs, parameters initialisation, etc.

Data Assess Pipeline
Session 8 - Reinforcement Learning

Reinforcement Learning

Session 8 - Reinforcement Learning

Reinforcement Learning

Reinforcement learning is a computational approach to learning from interaction with an environment to achieve a goal through trial and error. An agent learns to make decisions by receiving rewards or penalties (i.e., reinforcements) for its actions.

Session 8 - Reinforcement Learning

Reinforcement Learning

Search Agent Uninformed
Agent in an uninformed search problem.
Session 8 - Reinforcement Learning

Reinforcement Learning

Search Agent Uninformed
Agent in an uninformed search problem.

The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.

Session 8 - Reinforcement Learning

Reinforcement Learning

Search Agent Uninformed
Agent in an uninformed search problem.

The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.

Providing a reward signal is easier than providing labelled examples (i.e., supervised learning).

Session 8 - Reinforcement Learning

Reinforcement Learning

Search Agent Uninformed
Agent in an uninformed search problem.

The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.

Providing a reward signal is easier than providing labelled examples (i.e., supervised learning).

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

The RL framework is composed of:


  • Agent: The learner and decision maker
  • Environment: The world in which the agent operates
  • State: Current situation of the environment
  • Action: What the agent can do
  • Reward: Feedback from the environment
Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

The RL framework is composed of:


  • Agent: The learner and decision maker
  • Environment: The world in which the agent operates
  • State: Current situation of the environment
  • Action: What the agent can do
  • Reward: Feedback from the environment

The environment is stochastic, meaning that the outcomes of actions taken by the agent in each state are not deterministic.

Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

Markov Decision Process (MDP):

A mathematical framework for modeling sequential decisions problems for fully observable, stochastic environments. The outcomes are partly random and partly under the control of a decision maker.

Session 8 - Reinforcement Learning

Reinforcement Learning

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.

Markov Decision Process (MDP):

A mathematical framework for modeling sequential decisions problems for fully observable, stochastic environments. The outcomes are partly random and partly under the control of a decision maker.

Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

A MDP is a 4-tuple:


Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

A MDP is a 4-tuple:



Where:


is a set of states with initial state
is a set of actions in each state
is a transition model that tells the probability of reaching , if the agent is in and performs action
is the reward function that tells the reward for every transition from to through

Session 8 - Reinforcement Learning

Reinforcement Learning


The transition model describes the outcome of each action in each state. Since the outcome is stochastic, we write:


MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


The transition model describes the outcome of each action in each state. Since the outcome is stochastic, we write:



Transitions are Markovian: The probability of reaching from depends only on and not on the history or earlier states.

Uncertainty once again brings MDPs closer to reality when compared against deterministic approaches.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


From every transition the agent receives a reward:


MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


From every transition the agent receives a reward:



The agent wants to maximise the sum of the received rewards (i.e., utility function):


The utility function depends on a sequence of states and actions named the environment history.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


The solution for this problem is called a policy, which specifies what the agent should do for any state that the agent might reach. A policy is a mapping from states to actions that tells the agent what to do in each state:


Deterministic Policy:

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


The solution for this problem is called a policy, which specifies what the agent should do for any state that the agent might reach. A policy is a mapping from states to actions that tells the agent what to do in each state:


Deterministic Policy:


Stochastic Policy:

where is the probability of taking action in state .

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0 , via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning


The quality of the policy in a given state is measured by the expected utility of the possible environment histories generated by that policy. We can compute the utility of state sequences using additive (discounted) rewards as follows:



where is the discount factor that determines the importance of future rewards.

Session 8 - Reinforcement Learning

Reinforcement Learning


The quality of the policy in a given state is measured by the expected utility of the possible environment histories generated by that policy. We can compute the utility of state sequences using additive (discounted) rewards as follows:



where is the discount factor that determines the importance of future rewards.

The expected utility of executing the policy , starting in state , is given by:

is with respect to the probability distribution over state sequences determined by and .

Session 8 - Reinforcement Learning

Reinforcement Learning


We can compare policies at a given state using their expected utilities:


is with respect to the probability distribution over state sequences determined by and .

Session 8 - Reinforcement Learning

Reinforcement Learning


We can compare policies at a given state using their expected utilities:


is with respect to the probability distribution over state sequences determined by and .

The goal is to select the policy that maximises the expected reward:


The policy recommends an action for every state in the sequence starting in state .

Session 8 - Reinforcement Learning

Reinforcement Learning


The utility function allows the agent to select actions by using the principle of maximum expected utility. The agent chooses the action that maximises the reward for the next step plus the expected discounted utility of the subsequent step:


Session 8 - Reinforcement Learning

Reinforcement Learning


The utility function allows the agent to select actions by using the principle of maximum expected utility. The agent chooses the action that maximises the reward for the next step plus the expected discounted utility of the subsequent step:



The utility of a state is the expected reward for the next transition plus the discounted utility of the next state, assuming that the agent chooses the optimal action. The utility of a state
is given by:


This is called the Bellman Equation, after Richard Bellman (1957).

Session 8 - Reinforcement Learning

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:


Session 8 - Reinforcement Learning

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:



relates to the utility function as follows:

Session 8 - Reinforcement Learning

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:



relates to the utility function as follows:


Then we have:

Session 8 - Reinforcement Learning

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:



The Q-function tells us how good it is to take action in state .

Session 8 - Reinforcement Learning

Reinforcement Learning


Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:



The Q-function tells us how good it is to take action in state .

The optimal policy can be extracted from as follows:

Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

Model-Based RL Agent

Model-Free RL Agent

Session 8 - Reinforcement Learning

Reinforcement Learning

RL Framework
Reinforcement Learning Framework - Megajuice, CC0, via Wikimedia Commons.

Model-Based RL Agent

Model-Free RL Agent

  • Knows transition model and reward function
  • Can simulate outcomes before taking actions
  • Value Iteration, Policy Iteration
  • Unknown transition model and reward function
  • Cannot simulate outcomes
  • Q-Learning, DQN
Session 8 - Reinforcement Learning

Reinforcement Learning



Policy Iteration is a model-based reinforcement learning algorithm that alternates between policy evaluation and policy improvement to find the optimal policy.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning



Policy Iteration is a model-based reinforcement learning algorithm that alternates between policy evaluation and policy improvement to find the optimal policy.

Finds the optimal policy through iterative refinement:

  • Requires a model of the environment (transition and reward functions)
  • Guaranteed to converge to the optimal policy
  • Works with finite state and action spaces
Stochastic Matrix
Stochastic Matrix.
Session 8 - Reinforcement Learning

Reinforcement Learning



Policy Iteration consists of two main phases that alternate until convergence:

Session 8 - Reinforcement Learning

Reinforcement Learning



Policy Iteration consists of two main phases that alternate until convergence:


1. Policy Evaluation: Compute the value function for the current policy:


Session 8 - Reinforcement Learning

Reinforcement Learning



Policy Iteration consists of two main phases that alternate until convergence:


1. Policy Evaluation: Compute the value function for the current policy:



2. Policy Improvement: Update the policy to be greedy with respect to the current value function:


Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration Algorithm:

The algorithm iteratively improves the policy by alternating between evaluation and improvement steps until convergence to the optimal policy.


MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration Algorithm:

The algorithm iteratively improves the policy by alternating between evaluation and improvement steps until convergence to the optimal policy.



The optimal policy maximizes the expected cumulative reward by selecting actions that lead to the highest Q-values.

Policy iteration guarantees convergence to the optimal policy through the principle of policy improvement.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration Algorithm:
Initialise policy π randomly
Repeat until convergence:
    // Policy Evaluation
    Repeat until convergence:
        For each state s:
            V(s) = Σ P(s'|s,π(s)) [R(s,π(s),s') + γV(s')]
    // Policy Improvement
    policy_stable = true
    For each state s:
        old_action = π(s)
        π(s) = argmaxₐ Σ P(s'|s,a) [R(s,a,s') + γV(s')]
        If old_action ≠ π(s):
            policy_stable = false
    If policy_stable:
        break
MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.
Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration Algorithm:
Initialise policy π randomly
Repeat until convergence:
    // Policy Evaluation
    Repeat until convergence:
        For each state s:
            V(s) = Σ P(s'|s,π(s)) [R(s,π(s),s') + γV(s')]
    // Policy Improvement
    policy_stable = true
    For each state s:
        old_action = π(s)
        π(s) = argmaxₐ Σ P(s'|s,a) [R(s,a,s') + γV(s')]
        If old_action ≠ π(s):
            policy_stable = false
    If policy_stable:
        break

Convergence Properties:

  • Monotonic Improvement: Each iteration improves the policy
  • Finite Convergence: Guaranteed to converge in finite steps
  • Optimal Policy: Converges to the optimal policy π*
  • Bellman Optimality: Final policy satisfies Bellman optimality equations
Session 8 - Reinforcement Learning

Reinforcement Learning


Policy Iteration Limitations:

Stochastic Matrix
Stochastic Matrix.
Session 8 - Reinforcement Learning

Reinforcement Learning


Policy Iteration Limitations:

  • Model Dependency: Requires complete knowledge of the environment's transition and reward functions.
  • Computational Cost: Policy evaluation can be expensive for large state spaces, requiring iterative computation.
  • Discrete Spaces: Designed for finite state and action spaces, not suitable for continuous environments.
  • Memory Requirements: Needs to store value functions and policies for all states.
Stochastic Matrix
Stochastic Matrix.
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-learning is a model-free reinforcement learning algorithm that learns the optimal action-value function directly from experience by interacting with the environment.

Finds the optimal policy by learning the Q-function:

  • Does not require a model of the environment (transition or reward function)
  • Can be used in stochastic and unknown environments
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-learning is a model-free reinforcement learning algorithm that learns the optimal action-value function directly from experience by interacting with the environment.

Finds the optimal policy by learning the Q-function:

  • Does not require a model of the environment (transition or reward function)
  • Can be used in stochastic and unknown environments
Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning



At each step, the agent updates its estimate of the Q-function for a state and action using the observed reward and the maximum estimated value of the next state. The update rule is defined as follows:


where:
is the learning rate
is the discount factor
is the observed reward
is the next state after taking action in

Session 8 - Reinforcement Learning

Reinforcement Learning

Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
    Initialize state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Q(s, a) ← update(Q, s, a, s', r)
        s ← s'
    until s is terminal
Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
    Initialize state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Q(s, a) ← update(Q, s, a, s', r)
        s ← s'
    until s is terminal

The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.

Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
    Initialize state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Q(s, a) ← update(Q, s, a, s', r)
        s ← s'
    until s is terminal

The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.

ε-Greedy Policy:

Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning

Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
    Initialize state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Q(s, a) ← update(Q, s, a, s', r)
        s ← s'
    until s is terminal

The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.

ε-Greedy Policy:

Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning


Q-Learning Limitations:

Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning


Q-Learning Limitations:

  • Scalability: Q-learning struggles with large state spaces as it requires a Q-value for every state-action pair, leading to high memory usage.
  • Generalisation: Q-learning does not generalise well to unseen states since it relies on a discrete Q-table.
  • Continuous State Spaces: Q-learning is not suitable for environments with continuous state spaces as it requires discretisation, which can lead to loss of information.
  • Sample Inefficiency: Q-learning can be sample-inefficient, requiring many interactions with the environment to learn an optimal policy.
Q-Matrix
Transition Matrix (Q).
Session 8 - Reinforcement Learning

Reinforcement Learning

Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:


Session 8 - Reinforcement Learning

Reinforcement Learning

Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:


Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Reinforcement Learning

Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:



A main network approximates and are its trainable parameters. The input of the network is the state and the output is a Q-value per possible action.

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Reinforcement Learning

Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:



A main network approximates and are its trainable parameters. The input of the network is the state and the output is a Q-value per possible action.

DQNs store past experiences in a replay buffer. During training a minibatch of experiences are selected. A target network is used to compute the target Q-values during updates.

Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Reinforcement Learning

DQN Algorithm:
Initialise replay buffer D
Initialise Q-network with random weights w
Initialise target Q-network with weights w⁻ = w
Repeat (for each episode):
    Initialise state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Store (s, a, r, s') in D
        For each (s, a, r, s') in mini_batch(D):
            y = r + γ maxₐ' Q(s', a'; w⁻)
        w ← gradient_descent((y - Q(s, a; w))²)
        Every C steps, update w⁻ ← w
        s ← s'
    until s is terminal
Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Reinforcement Learning

DQN Algorithm:
Initialise replay buffer D
Initialise Q-network with random weights w
Initialise target Q-network with weights w⁻ = w
Repeat (for each episode):
    Initialise state s
    Repeat (for each step of episode):
        a ← π(s)
        // Act according to a
        r ← = R(s, a, s')
        Store (s, a, r, s') in D
        For each (s, a, r, s') in mini_batch(D):
            y = r + γ maxₐ' Q(s', a'; w⁻)
        w ← gradient_descent((y - Q(s, a; w))²)
        Every C steps, update w⁻ ← w
        s ← s'
    until s is terminal

ε-Greedy Policy:

Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration

Q-Value

DQN

Session 8 - Reinforcement Learning

Reinforcement Learning

Policy Iteration

Q-Value

DQN

Value-function

Model-based with guaranteed convergence for finite and discrete problems.

MDP Process
Markov Decision Process - waldoalvarez, CC BY-SA 4.0, via Wikimedia Commons.

Q-function

Model-free and simple for small and discrete problems.

Q-Matrix
Transition Matrix (Q).

Neural Network

Model-free and complex for large and continuous problems.



Deep Neural Network
Deep Neural Network with multiple hidden layers - QuantuMechaniX8, CC0, via Wikimedia Commons
Session 8 - Reinforcement Learning

Reinforcement Learning

AlphaGo Zero
Self-play reinforcement learning in AlphaGo Zero - Silver et al., 2017.
Session 8 - Reinforcement Learning

Reinforcement Learning

AlphaGo Zero
Self-play reinforcement learning in AlphaGo Zero - Silver et al., 2017.

AlphaGo Zero combines key RL concepts:

  • Self-play environment: Agent plays against itself (no human data needed)
  • Policy network: Learns π(s) → probability distribution over actions
  • Value network: Learns V(s) → probability of winning from state s
  • Monte Carlo Tree Search: Uses policy/value to guide search
  • Temporal difference learning: Updates based on game outcomes
  • Experience replay: Stores and learns from self-play games
Session 8 - Reinforcement Learning

AI History - Machine Learning Age (2001 - present)

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 First AI Winter (1974-1980) Second AI Winter (1987-1994) Big Data (2000-2012) Artificial Neuron (McCulloch & Pitts, 1943) Information Theory (Shannon, 1948) Cybernetics (Wiener, 1948) Updating Rule (Hebbian, 1949) Computing Machinery and Intelligence (Turing, 1950) SNARC (Minsky, 1951) AI Term (Dartmouth Workshop, 1956) GPS (Newell & Simon, 1957) Advice Taker (McCarthy, 1958) Back-Propagation (Kelley, 1960) Perceptrons (Rosenblatt, 1962) ELIZA (MIT, 1966) ALPAC Report (USA, 1966) The DENDRAL (Buchanan, 1969) Perceptrons Book (Minsky & Papert, 1969) PROLOG (1972) MYCIN (Stanford, 1972) Lighthill Report (UK, 1973) FRAMES (1975) Hopfield net (1982) R1 (McDermott, 1982) Parallel Distributed Processing (Rumelhart & McClelland, 1986) Bayesian Networks (Pearls, 1988) Reinforcement Learning (Sutton, 1988) Image Recognition (LeCun et al., 1990) Deep Blue beats Kasparov (IBM, 1997) Deep Learning (Hinton, 2006) Watson wins Jeopardy (2011) AlexNet (Krizhevsky, 2012) GANs (Goodfellow, 2014) AlphaGo beats Lee Sedol (DeepMind, 2016) Transformer (Vaswani, 2017) AlphaFold (DeepMind, 2018) GPT-1 (OpenAI, 2020) BERT (Google, 2019) Chinchilla (DeepMind, 2022) ChatGPT (OpenAI, 2022) LLaMA (Meta AI, 2023) Claude 2 (Anthropic, 2023) phi-3 (Microsoft, 2024) Gemini 1.5 (Google DeepMind, 2024) Qwen3 (Alibaba, 2025) R1 (DeepSeek) 2025)
Session 8 - Reinforcement Learning

Conclusions

Session 8 - Reinforcement Learning

Conclusions

Overview

  • Hyperparameters Tuning
  • Reinforcement Learning
  • Markov Decision Process
  • Model-based RL
  • Model-free RL
Session 8 - Reinforcement Learning

Conclusions

Overview

  • Hyperparameters Tuning
  • Reinforcement Learning
  • Markov Decision Process
  • Model-based RL
  • Model-free RL

Next Time

  • Attention Architecture
  • Transformers
  • Large Language Models (LLMs)
  • LLMs APIs
  • Prompt Engineering
Session 8 - Reinforcement Learning

Many Thanks!

chc79@cam.ac.uk

_script: true

This script will only execute in HTML slides

_script: true