Neural networks extend the perceptron by introducing multiple layers of interconnected neurons (i.e., multi-layer perceptron), enabling the learning of complex, non-linear relationships in data.
Neural Networks:
Training is costly but inference is cheap.
Key Components:
Feedforward Neural Network:
The overall network function combines these stages. For sigmoidal output unit activation functions, takes the form:
The bias parameters can be absorbed:
Sigmoid Function:
Range:
Derivative:
Hyperbolic Tangent:
Range:
Derivative:
ReLU (Rectified Linear Unit):
Range:
Derivative:
Training Process:
Given a training set of N example input-output pairs
Each pair was generated by an unknown function
We want to find a hypothesis
We choose any starting point and then compute an estimate of the gradient and move a small amount in the steepest downhill direction, repeating until we converge on a point in the weight space with (local) minima loss.
Gradient Descent Algorithm:
Initialise w
repeat
for each w[i] in w
Compute gradient: g = ∇Loss(w[i])
Update weight: w[i] = w[i] - α * g
until convergence
The size of the step is given by the parameter α, which regulates the behaviour of the gradient descent algorithm. This is a hyperparameter of the regression model we are training, usually called learning rate.
Error Backpropagation Algorithm:
1. Forward Pass: Compute all activations and outputs for an input vector.
2. Error Evaluation: Evaluate the error for all the outputs using:
3. Backward Pass: Backpropagate errors for each hidden unit in the network using:
4. Derivatives Evaluation: Evaluate the derivatives for each parameter using:
Gradient Descent Update Rule:
Where
Hyperparameters tuning involves the process of optimising the parameters that govern the training process of a machine learning model, such as learning rate, number of layers, batch size, and number of epochs, to improve its performance and accuracy.
There is not a straight single answer to determine the "right" hyperparameter values.
There is not a straight single answer to determine the "right" hyperparameter values. But, experts follow similar steps:
There is not a straight single answer to determine the "right" hyperparameter values. But, experts follow similar steps:
See more in Karpathy's recipe, and Tobin's lecture.
In the process, you should combine the methods previously discussed like preprocessing, feature engineering, normalising inputs, parameters initialisation, etc.
In the process, you should combine the methods previously discussed like preprocessing, feature engineering, normalising inputs, parameters initialisation, etc.
Reinforcement learning is a computational approach to learning from interaction with an environment to achieve a goal through trial and error. An agent learns to make decisions by receiving rewards or penalties (i.e., reinforcements) for its actions.
The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.
The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.
Providing a reward signal is easier than providing labelled examples (i.e., supervised learning).
The agent aims to maximise the rewards from its actions. Rewards can be immediate or sparse.
Providing a reward signal is easier than providing labelled examples (i.e., supervised learning).
The RL framework is composed of:
The RL framework is composed of:
The environment is stochastic, meaning that the outcomes of actions taken by the agent in each state are not deterministic.
Markov Decision Process (MDP):
A mathematical framework for modeling sequential decisions problems for fully observable, stochastic environments. The outcomes are partly random and partly under the control of a decision maker.
Markov Decision Process (MDP):
A mathematical framework for modeling sequential decisions problems for fully observable, stochastic environments. The outcomes are partly random and partly under the control of a decision maker.
A MDP is a 4-tuple:
A MDP is a 4-tuple:
Where:
The transition model describes the outcome of each action in each state. Since the outcome is stochastic, we write:
The transition model describes the outcome of each action in each state. Since the outcome is stochastic, we write:
Transitions are Markovian: The probability of reaching
Uncertainty once again brings MDPs closer to reality when compared against deterministic approaches.
From every transition the agent receives a reward:
From every transition the agent receives a reward:
The agent wants to maximise the sum of the received rewards (i.e., utility function):
The utility function
The solution for this problem is called a policy, which specifies what the agent should do for any state that the agent might reach. A policy is a mapping from states to actions that tells the agent what to do in each state:
Deterministic Policy:
The solution for this problem is called a policy, which specifies what the agent should do for any state that the agent might reach. A policy is a mapping from states to actions that tells the agent what to do in each state:
Deterministic Policy:
Stochastic Policy:
where
The quality of the policy in a given state is measured by the expected utility of the possible environment histories generated by that policy. We can compute the utility of state sequences using additive (discounted) rewards as follows:
where
The quality of the policy in a given state is measured by the expected utility of the possible environment histories generated by that policy. We can compute the utility of state sequences using additive (discounted) rewards as follows:
where
The expected utility of executing the policy
We can compare policies at a given state using their expected utilities:
We can compare policies at a given state using their expected utilities:
The goal is to select the policy
The policy
The utility function allows the agent to select actions by using the principle of maximum expected utility. The agent chooses the action that maximises the reward for the next step plus the expected discounted utility of the subsequent step:
The utility function allows the agent to select actions by using the principle of maximum expected utility. The agent chooses the action that maximises the reward for the next step plus the expected discounted utility of the subsequent step:
The utility of a state is the expected reward for the next transition plus the discounted utility of the next state, assuming that the agent chooses the optimal action. The utility of a state
is given by:
This is called the Bellman Equation, after Richard Bellman (1957).
Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:
Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:
Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:
Then we have:
Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:
The Q-function tells us how good it is to take action
Another important quantity is the action-utility function or Q-function, which is the expected utility of taking a given action in a given state:
The Q-function tells us how good it is to take action
The optimal policy can be extracted from
Model-Based RL Agent
Model-Free RL Agent
Model-Based RL Agent
Model-Free RL Agent
Policy Iteration is a model-based reinforcement learning algorithm that alternates between policy evaluation and policy improvement to find the optimal policy.
Policy Iteration is a model-based reinforcement learning algorithm that alternates between policy evaluation and policy improvement to find the optimal policy.
Finds the optimal policy through iterative refinement:
Policy Iteration consists of two main phases that alternate until convergence:
Policy Iteration consists of two main phases that alternate until convergence:
1. Policy Evaluation: Compute the value function for the current policy:
Policy Iteration consists of two main phases that alternate until convergence:
1. Policy Evaluation: Compute the value function for the current policy:
2. Policy Improvement: Update the policy to be greedy with respect to the current value function:
Policy Iteration Algorithm:
The algorithm iteratively improves the policy by alternating between evaluation and improvement steps until convergence to the optimal policy.
Policy Iteration Algorithm:
The algorithm iteratively improves the policy by alternating between evaluation and improvement steps until convergence to the optimal policy.
The optimal policy maximizes the expected cumulative reward by selecting actions that lead to the highest Q-values.
Policy iteration guarantees convergence to the optimal policy through the principle of policy improvement.
Policy Iteration Algorithm:
Initialise policy π randomly
Repeat until convergence:
// Policy Evaluation
Repeat until convergence:
For each state s:
V(s) = Σ P(s'|s,π(s)) [R(s,π(s),s') + γV(s')]
// Policy Improvement
policy_stable = true
For each state s:
old_action = π(s)
π(s) = argmaxₐ Σ P(s'|s,a) [R(s,a,s') + γV(s')]
If old_action ≠ π(s):
policy_stable = false
If policy_stable:
break
Policy Iteration Algorithm:
Initialise policy π randomly
Repeat until convergence:
// Policy Evaluation
Repeat until convergence:
For each state s:
V(s) = Σ P(s'|s,π(s)) [R(s,π(s),s') + γV(s')]
// Policy Improvement
policy_stable = true
For each state s:
old_action = π(s)
π(s) = argmaxₐ Σ P(s'|s,a) [R(s,a,s') + γV(s')]
If old_action ≠ π(s):
policy_stable = false
If policy_stable:
break
Convergence Properties:
Policy Iteration Limitations:
Policy Iteration Limitations:
Q-learning is a model-free reinforcement learning algorithm that learns the optimal action-value function directly from experience by interacting with the environment.
Finds the optimal policy by learning the Q-function:
Q-learning is a model-free reinforcement learning algorithm that learns the optimal action-value function directly from experience by interacting with the environment.
Finds the optimal policy by learning the Q-function:
At each step, the agent updates its estimate of the Q-function for a state and action using the observed reward and the maximum estimated value of the next state. The update rule is defined as follows:
where:
Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
Initialize state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Q(s, a) ← update(Q, s, a, s', r)
s ← s'
until s is terminal
Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
Initialize state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Q(s, a) ← update(Q, s, a, s', r)
s ← s'
until s is terminal
The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.
Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
Initialize state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Q(s, a) ← update(Q, s, a, s', r)
s ← s'
until s is terminal
The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.
ε-Greedy Policy:
Q-Learning Algorithm:
Initialize Q(s, a) arbitrarily for all s, a
Repeat (for each episode):
Initialize state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Q(s, a) ← update(Q, s, a, s', r)
s ← s'
until s is terminal
The agent must balance exploring new actions to discover their value and exploiting known actions to maximize reward.
ε-Greedy Policy:
Q-Learning Limitations:
Q-Learning Limitations:
Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:
Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:
Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:
A main network approximates
Deep Q-Networks (DQN) extend Q-learning to environments with large or continuous state spaces. It replaces the Q-table using neural networks to approximate the Q-function:
A main network approximates
DQNs store past experiences
DQN Algorithm:
Initialise replay buffer D
Initialise Q-network with random weights w
Initialise target Q-network with weights w⁻ = w
Repeat (for each episode):
Initialise state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Store (s, a, r, s') in D
For each (s, a, r, s') in mini_batch(D):
y = r + γ maxₐ' Q(s', a'; w⁻)
w ← gradient_descent((y - Q(s, a; w))²)
Every C steps, update w⁻ ← w
s ← s'
until s is terminal
DQN Algorithm:
Initialise replay buffer D
Initialise Q-network with random weights w
Initialise target Q-network with weights w⁻ = w
Repeat (for each episode):
Initialise state s
Repeat (for each step of episode):
a ← π(s)
// Act according to a
r ← = R(s, a, s')
Store (s, a, r, s') in D
For each (s, a, r, s') in mini_batch(D):
y = r + γ maxₐ' Q(s', a'; w⁻)
w ← gradient_descent((y - Q(s, a; w))²)
Every C steps, update w⁻ ← w
s ← s'
until s is terminal
ε-Greedy Policy:
Policy Iteration
Q-Value
DQN
Policy Iteration
Q-Value
DQN
Value-function
Model-based with guaranteed convergence for finite and discrete problems.
Q-function
Model-free and simple for small and discrete problems.
Neural Network
Model-free and complex for large and continuous problems.
AlphaGo Zero combines key RL concepts:
_script: true
This script will only execute in HTML slides
_script: true