Reinforcement Learning
Your complete interactive study guide covering all lectures, MDPs, Dynamic Programming, Monte Carlo, Temporal Difference Learning, and Deep RL.
Intro to RL
Core concepts, four pillars, reward hypothesis, agent components and categories.
Lecture 1MDPs & Bellman
Markov Decision Processes, value functions, Bellman equations and optimality.
Lecture 2DP & Monte Carlo
Policy iteration, value iteration, and model-free MC prediction and control.
Lecture 3Model-Free Methods
TD Learning, TD(n) multi-step, SARSA, GLIE convergence, MC vs TD trade-offs.
Lecture 4Model-Free Control
Q-Learning, Double Q-Learning, overestimation bias, value function approximation.
Lecture 5VFA & Planning
Gradient descent VFA, DQN full algorithm, Model-Based RL, Dyna-Q, MCTS.
Lecture 6Policy-Based RL
Policy gradients, score function, Actor-Critic, PPO, SAC, TD3 and modern deep RL algorithms.
Lecture 7Quiz
25 questions across all topics. Test your knowledge and prepare for the exam.
PracticeYour Progress
Introduction to Reinforcement Learning
Lecture 1 · Inês Castelhano · April 2026
What is Reinforcement Learning?
RL vs Other Machine Learning Paradigms
| Supervised Learning | Unsupervised Learning | Reinforcement Learning | |
|---|---|---|---|
| Data | Labelled examples | Unlabelled data | Experience / interaction |
| Feedback | Immediate, per sample | None (structure) | Delayed reward signal |
| Goal | Predict / classify | Find patterns | Maximise cumulative reward |
| Time | i.i.d. data | i.i.d. data | Sequential, non i.i.d. |
The Agent–Environment Interaction Loop
Reward Hypothesis
Reward Rt
Immediate feedback at each time step. Scalar signal indicating how well the agent is doing right now.
Return Gt
Total discounted future reward from time t. The agent's goal is to maximise Gt.
Discount Factor γ (Gamma)
γ ∈ [0, 1] controls how much future rewards are valued relative to immediate ones.
The Four Pillars of RL
1. Optimisation
Find an optimal way to make decisions, yielding best or very good outcomes.
2. Delayed Consequences
Decisions now have long-term impact. Early rewards can hinder the future. Credit must be assigned to past actions.
3. Exploration
Learning about the world by making decisions. Only reward is received for the decision made — this shapes what is learned.
4. Generalisation
Policy maps experience to action. The agent must generalise from seen situations to unseen ones.
What Makes RL Different?
Core Agent Components
The agent's core behaviour function — maps states to actions.
Prediction of future reward from a given state. Used to evaluate states.
Agent's internal representation of the environment.
Prediction vs. Control
Prediction
Evaluate the future for a given policy. Estimate the value function to measure how much reward the agent will accumulate.
Control
Optimise the future — find the best policy that maximises expected cumulative reward.
Learning vs. Planning
Learning
Environment is initially unknown. Agent interacts to discover optimal behaviour.
Planning
A model of the environment is given or learned. Agent plans inside the model without further real interaction.
Observability, State & Policy Types
Environment State vs. Agent State
🌍 Environment State
The environment's internal representation — everything about the world. Usually invisible to the agent. Even when visible, it may contain large amounts of irrelevant information.
🤖 Agent State
The agent's own representation of what it needs to make decisions. A function of the history Ht — compact summary of all past observations, actions, and rewards.
History: The Full Record
Contains all past information. Problem: grows unboundedly with time — too large to use directly. The agent state is a compact function of this history.
Fully Observable vs. Partially Observable
👁️ Fully Observable (MDP)
Agent directly observes the environment state: Sta = Ste. The agent has complete information. Formally modelled as a Markov Decision Process (MDP).
- Optimal deterministic policy always exists
- Standard RL algorithms apply directly
🙈 Partially Observable (POMDP)
Agent only receives partial observations Ot ≠ Ste. Must maintain a belief state (probability distribution over possible environment states). Formally a Partially Observable MDP (POMDP).
- Optimal policy may be stochastic
- Requires memory (RNN, belief state)
Policy Types
Deterministic Policy
Maps each state to a single specific action. Simple and easy to analyse. Optimal in fully observable MDPs.
- Used in: DQN (greedy argmax), TD3, DDPG
- Pros: predictable, no variance in action selection
- Cons: requires separate exploration mechanism
Stochastic Policy
Maps each state to a probability distribution over actions. Necessary for POMDPs and game-theoretic settings.
- Used in: PPO, SAC, REINFORCE, A3C
- Pros: built-in exploration, handles partial observability
- Cons: harder to analyse and optimise
Exploration vs. Exploitation
Exploration
Try new actions to discover potentially better strategies. Find more information about the environment.
Exploitation
Use what you already know to collect high rewards right now. Exploit known information to maximise reward.
Interactive: Explore vs. Exploit
Slide to change the exploration rate ε (probability of exploring):
Agent Categories
By Internal Components
| Category | Policy | Value Function | Description |
|---|---|---|---|
| Value-Based | Implicit | ✅ Yes | Learn value function; derive policy greedily (e.g. Q-Learning) |
| Policy-Based | ✅ Yes | ❌ No | Learn policy directly; no explicit value function (e.g. REINFORCE) |
| Actor-Critic | ✅ Yes | ✅ Yes | Learn both — actor updates policy, critic evaluates it |
By Use of a Model
Model-Free
No explicit model of the environment. Learns from direct interaction. Examples: Q-Learning, SARSA, Policy Gradient.
Model-Based
Builds or uses a model of the environment. Can plan ahead. Examples: Dynamic Programming, AlphaGo (MCTS).
Markov Decision Processes
Lecture 2 · Tabular Value-Based RL
The Markov Property
Once St is known, the history Ht is no longer needed. The current state is a sufficient statistic of the future.
Full Observability (MDP)
Agent sees the full environment state. Observation = Environment State = Agent State.
Partial Observability (POMDP)
Agent cannot directly observe environment state. Must construct a Markov state from history.
MDP: The Formal Definition
Delivery Driver Mapping
| MDP Component | Driver Problem |
|---|---|
| S — States | Location, active deliveries, time of day |
| A — Actions | Route choices, next destination |
| P — Transitions | Traffic uncertainty (same route → different outcomes) |
| R — Reward | Earnings from delivery minus delay penalty |
| γ — Discount | How much current delivery matters vs future opportunities |
Episodic vs Continuing Tasks
Episodic Tasks
Break into independent episodes. Each episode has a terminal state. Return is a finite sum.
Continuing Tasks
No natural endpoint. Return is an infinite sum — discount factor is essential to ensure convergence.
Value Functions
State Value Function Vπ(s)
How good is it to be in state s following policy π?
= Expected return starting from state s, following π forever.
Action Value Function Qπ(s,a)
How good is it to take action a in state s following policy π?
= Expected return starting from s, taking a, then following π.
Optimal Value Functions
v*(s)
Maximum value achievable from state s, over all possible policies.
Q*(s,a)
Maximum value of taking action a in state s, over all policies.
Bellman Equations
Bellman Expectation Equations
For Vπ
For Qπ
Bellman Optimality Equations
Optimal V*
Optimal Q*
Breaking Down the Bellman Equation
Frozen Lake — MDP Example
A classic grid-world problem: navigate from Start (S) to Goal (G) without falling into Holes (H).
MDP Components
Dynamic Programming & Monte Carlo
Lecture 3 · Planning by Dynamic Programming
Dynamic Programming
Key assumption: Full knowledge of the environment model (P and R).
Optimal Substructure
The optimal solution can be composed from optimal solutions to sub-problems.
Overlapping Subproblems
Sub-problems recur many times — their cached solutions can be reused.
DP Algorithm Summary
| Algorithm | Bellman Equation Used | Problem Type |
|---|---|---|
| Iterative Policy Evaluation | Expectation Equations | Prediction |
| Policy Iteration | Expectation + Greedy Improvement | Control |
| Value Iteration | Optimality Equations | Control |
Policy Iteration
Policy Evaluation Formula
At each iteration, every state's value is updated using the current estimates of neighbouring states. Repeating this until convergence gives Vπ.
Value Iteration
Policy Iteration
- Full policy evaluation at each step
- Guaranteed convergence to π*
- Slower per iteration, fewer iterations
Value Iteration
- One backup per iteration
- Also converges to optimal policy
- Faster per iteration, more iterations
Asynchronous DP
Standard DP updates all states in parallel (synchronous). Asynchronous DP updates states individually in any order — more efficient in practice.
In-Place DP
Only store one copy of value function — use immediately updated values.
Prioritised Sweeping
Update states with the largest Bellman error first. Maintain a priority queue.
Real-Time DP
Only update states relevant to the current agent trajectory.
DP Limitations
Asynchronous Dynamic Programming
Three Asynchronous DP Ideas
1. In-Place DP
Synchronous value iteration stores two copies of V (current and next). In-place DP stores only one copy and uses newly updated values immediately in the same sweep.
Can converge faster because updated estimates propagate within the same iteration.
2. Prioritised Sweeping
Use the magnitude of the Bellman error to guide which state to update next. Back up the state with the largest remaining error — focus compute where it matters most.
- Requires knowledge of reverse dynamics (predecessor states)
- Implemented efficiently with a priority queue
- Update Bellman error of affected states after each backup
3. Real-Time DP
Only update states that are relevant to the agent. If the agent is in state S, update V(S) or states it expects to visit soon. Ignores irrelevant distant states entirely.
- Interleaves planning with real agent interaction
- Focuses compute on states the agent actually visits
- Natural fit for online RL settings
Full-Width Backups vs Sample Backups
Full-Width Backups (Standard DP)
For each backup: every successor state and action is considered. Uses the true model of transitions and rewards.
- Effective for medium-sized problems (millions of states)
- Exact value updates — no approximation
- Curse of dimensionality: even one backup can be too expensive
- Requires complete knowledge of MDP dynamics
Sample Backups (Model-Free)
Use sampled rewards and sampled transitions instead of the full distribution. Leads naturally to Monte Carlo and TD methods.
- Model-free: no advance knowledge of MDP required
- Breaks curse of dimensionality through sampling
- Cost of backup is constant (one sample)
- Foundation of all model-free RL
DP Limitations
Requires a Perfect Model
DP needs full knowledge of P(s'|s,a) and R(s,a). In real-world problems the model is unknown or only partially known.
Curse of Dimensionality
The state space grows exponentially with the number of state variables. For n variables with k values each: kn states. Quickly becomes intractable.
Full-Width Backups
Each sweep considers every state. For large MDPs a single sweep is prohibitively expensive. Async DP and sample backups are the remedies.
Monte Carlo Methods
Key requirement: MC methods are applied only to episodic tasks — must wait until episode ends.
First-Visit MC
Average returns only from the first occurrence of each state per episode.
S(St) ← S(St) + Gt
V(St) = S(St) / N(St)
Every-Visit MC
Average returns every time the state is visited in an episode.
Both converge to vπ(s) as N(s) → ∞ by the law of large numbers.
Incremental MC Updates
Instead of storing all returns and averaging at the end, update V(s) incrementally after each episode using the running mean:
For non-stationary problems, use a fixed learning rate α instead of 1/N — this gives more weight to recent episodes and "forgets" old ones:
α can be: 1/N(St) (exact mean), a fixed small value (exponential moving average), or even α=1 (pure online update, discard history completely). The error term (Gt − V(St)) is the MC prediction error.
MC Limitations
DP vs Monte Carlo — Comparison
| Property | Dynamic Programming | Monte Carlo |
|---|---|---|
| Model Required? | ✅ Yes (full model) | ❌ No (model-free) |
| Bootstrapping? | ✅ Yes | ❌ No (uses actual returns) |
| Works on Continuing Tasks? | ✅ Yes | ❌ Episodic only |
| Update timing | Each step | End of episode only |
| Bias | Low (exact model) | Zero bias |
| Variance | Low | High |
| Curse of dimensionality | Severe | Less severe (sampling) |
Model-Free Methods
Lecture 4 · Temporal Difference Learning · Inês Castelhano · May 2026
Temporal Difference Learning
TD is model-free (no knowledge of MDP transitions/rewards), learns directly from experience at each step, and uses bootstrapping — updating estimates using other estimates.
TD(0) Update Rule
Observed reward + bootstrapped future
Drives the update
Trading Example Mapping
| TD Concept | Stock Trading Interpretation |
|---|---|
| State St | Current market conditions (price, volatility, holdings) |
| Action At | Buy / Hold / Sell |
| Reward Rt+1 | Immediate profit or loss from the trade |
| TD Target | Observed short-term profit + estimated future portfolio value |
| TD Error δt | Difference between expected and observed market outcome |
| Update timing | After every market movement, not at market close |
• Updates at every step — online learning
• Works in continuing (non-episodic) settings
• Lower variance (only one step of randomness)
• Can learn before episode ends
• No model required (P and R unknown)
• Learns from sampled experience
• Scales to large problems without full sweeps
Multi-Step TD: TD(n) and TD(∞)
TD(n) Update Formula
Trading Example — TD(3)
Observe 3 days of market returns, then use your current estimate of future portfolio value. No need to wait for the full trading period.
Trading Example — TD(∞)
Wait until the end of the full trading episode and use the complete realized return — equivalent to Monte Carlo.
SARSA — TD for Action Values
GLIE — Greedy in the Limit with Infinite Exploration
∀s,a: limt→∞ N(s,a) = ∞
Every (state, action) pair must be tried infinitely often — no pair is ignored forever.
limt→∞ πt(a|s) = 𝟙[a = argmaxa' q(s,a')]
Eventually exploit more and explore less — ε must decay to 0.
Tabular SARSA converges to q* if the policy is GLIE.
Trading Example
Early in training, try all combinations of market conditions and actions (buy/hold/sell) many times — even bad ones — so no strategy is left unexplored. Over time, reduce ε so the agent exploits more of what it learned.
Monte Carlo vs TD Learning
| Property | Monte Carlo | TD Learning |
|---|---|---|
| Update timing | End of episode only | After every step (online) |
| Episodic tasks only? | ✅ Yes | ❌ No — works for continuing too |
| Bootstrapping | ❌ No — uses actual return | ✅ Yes — uses estimated V(S') |
| Bias | Zero (unbiased) | Some bias (bootstrapping) |
| Variance | High — full episode randomness | Low — only 1-step randomness |
| Markov property needed? | ❌ No | ✅ Yes — exploits Markov property |
| Memory needed | Store full episode | Only current transition |
Monte Carlo — Pros & Cons
TD Learning — Pros & Cons
On-Policy vs Off-Policy Learning
On-Policy Learning
The agent learns the value of the same policy it is currently following. The behaviour policy (generating actions) equals the target policy (being improved).
Off-Policy Learning
The agent learns the value of a different (target) policy from the behaviour policy generating data. Can learn from other agents or old data.
Why Off-Policy Matters
DP vs Monte Carlo vs TD — Unified View
| Property | Dynamic Programming | Monte Carlo | TD Learning |
|---|---|---|---|
| Model required? | ✅ Yes | ❌ No | ❌ No |
| Bootstrapping? | ✅ Yes | ❌ No | ✅ Yes |
| Sampling? | ❌ No (expectation) | ✅ Yes | ✅ Yes |
| Update timing | Every step (full sweep) | End of episode | Every step |
| Episodic only? | ❌ No | ✅ Yes | ❌ No |
| Bias | Low | Zero | Some |
| Variance | Low | High | Lower than MC |
Backup Diagrams — Visual Summary
DP — Full width, shallow
All actions & states. Full model. 1-step expectation.
MC — Narrow, deep
Sampled path to terminal state. Full return.
TD(0) — Narrow, shallow
Single sampled step. Bootstraps from V(s').
Model-Free Control
Lecture 5 · Q-Learning, Double Q-Learning, VFA Intro · Inês Castelhano · May 2026
Q-Learning — Off-Policy TD Control
Autonomous Driving Example
A self-driving car learns the fastest route to a destination by always estimating future decisions as optimal, even while still exploring different (suboptimal) driving behaviours. It learns from the actions it wishes it would take.
Interactive Q-Table Demo
Watch Q-values update step by step in a simple 3-state world (use Left/Right to navigate toward Goal):
Q-Learning Overestimation Problem
It uses the same values to both select and evaluate actions. With noisy approximations, overestimated values are selected more often, propagating the bias.
Driving Example
A self-driving car incorrectly estimates that a narrow, high-speed route is safer and faster than it really is — because a few successful experiences produced overly optimistic Q-values.
Double Q-Learning — Decoupling Selection from Evaluation
Update for Q (when Q is selected)
Update for Q* (when Q* is selected)
SARSA vs Q-Learning — Complete Comparison
| Property | SARSA | Q-Learning |
|---|---|---|
| Policy type | On-policy | Off-policy |
| Target action | A' sampled from behaviour policy (ε-greedy) | maxa'Q(S',a') — greedy target |
| Update | Q(S,A) + α[R+γQ(S',A')−Q(S,A)] | Q(S,A) + α[R+γmax Q(S',a')−Q(S,A)] |
| Safety | More conservative — safer in dangerous environments | Can be more aggressive — ignores current risk |
| Convergence | Converges to q* if GLIE (ε→0) | Converges to q* as long as each (s,a) visited ∞ often |
| Overestimation | Less prone (stochastic target) | Prone — uses greedy max |
| Best for | Safety-critical tasks, continuous exploration | Finding optimal policy efficiently |
Why Tabular Methods Fail — Motivation for VFA
Memory
Too many states/actions to store in memory
Speed
Too slow to learn each state individually
Observability
Individual states often not fully observable
Generalisation
Each state learned independently — no transfer between similar states
Function Approximation — The Solution
Approximate value function using a parameterised function with weights w: v̂(s;w) ≈ vπ(s)
Adjust weights w using MC or TD learning to minimise prediction error
Updating w for one state automatically improves predictions for similar states
Classes of Function Approximators
| Type | Advantages | Disadvantages |
|---|---|---|
| Tabular | Strong theory, exact values | Does not scale or generalise |
| Linear | Reasonable theory, efficient, stable | Requires good hand-crafted features |
| Non-Linear | Scales well, learns features automatically | Less well-understood theory, harder to train |
| Deep Neural Networks | Often best performance in practice | Less theory, harder to train, needs much data |
Challenges of Function Approximation in RL
Unlike supervised learning, RL has specific properties that make function approximation harder:
Experience is not i.i.d.
Successive time-steps are correlated. Supervised learning assumes independent, identically distributed samples — that assumption fails in RL.
Policy affects data distribution
The policy determines which data is collected, which affects what is learned — creating a tight feedback loop between learning and experience.
The Deadly Triad
Bootstrapping
Using TD targets (estimates of estimates). Introduces bias and non-stationarity in targets.
Off-Policy Learning
Learning from data generated by a different policy. Mismatched distributions between experience and target policy.
Function Approximation
Using parameterised functions instead of tables. Small errors in approximation can compound.
Value Function Approximation & Planning
Lecture 6 · VFA, DQN, Model-Based RL, Dyna-Q, MCTS · Inês Castelhano · May 2026
Why Do We Need Value Function Approximation?
Memory
Too many states/actions to store in a table. Backgammon has ~10²⁰ states; Go has ~10¹⁷⁰.
Speed
Too slow to visit and learn each state individually. Most states may never be seen during training.
Observability
Individual environment states often not fully observable. Partial observability means we can't index a table by state.
Generalisation
Each state is learned independently — no transfer between similar states. VFA automatically generalises across states.
Function Approximation: Three-Step Process
Estimate
Approximate value function using a parameterised function v̂(s;w) with weights w. The function can be linear, neural network, or any differentiable model.
Update
Adjust weights w using MC or TD learning to minimise prediction error. Gradient descent moves w in the direction of lower loss.
Generalise
Updating w for one state automatically improves predictions for similar states. This is the key advantage over tabular methods.
Classes of Function Approximators
| Approximator | Advantages | Disadvantages |
|---|---|---|
| Tabular | Strong theory, exact values, guaranteed convergence | Does not scale or generalise — one value per state |
| Linear | Reasonable theory, efficient updates, stable convergence | Requires good hand-crafted features; limited expressiveness |
| Non-Linear | Scales well, can represent complex functions | Less well-understood theory, harder to train |
| (Deep) Neural Networks | Often performs best; learns features from raw inputs (pixels) | Less well-understood theory; requires careful training tricks |
Gradient-Based Learning for Value Functions
Incremental Gradient Descent
- Update weights after each experience sample
- Small step in direction of reduced error
- Works online during episode
- Can follow non-stationary targets
Stochastic Gradient Descent (SGD)
- Randomly samples from experience
- Estimates the true gradient
- Converges to local minimum with decaying α
- Unbiased estimate of full gradient
Specific Challenges in RL (vs Supervised Learning)
🔄 Experience is Not i.i.d.
Successive timesteps are correlated. Standard supervised learning assumes independent, identically distributed samples — that assumption fails in RL. Consecutive (s,a,r,s') tuples share context.
🎭 Policy Affects Data
The agent's policy determines which data is collected. This creates a tight feedback loop: what we learn changes the policy, which changes what data we see next.
🌊 Changing Policy
Policy changes alter both the target values and the data distribution simultaneously — a double non-stationarity problem.
🎯 Bootstrapping
TD methods use current estimates as targets, which themselves keep changing. We're chasing a moving target — and the target moves because of our own updates.
🌍 Non-Stationary Targets
Other learning agents in multi-agent settings shift the effective dynamics. Even solo agents face non-stationarity due to policy improvement.
♾️ Large State Space
In continuous state spaces, we never visit the exact same state twice. No stable anchor point for estimates — states seen only once or never during training.
Update Targets: MC vs TD
| Method | Update Target Ut | Gradient Type | Properties |
|---|---|---|---|
| MC + VFA | Actual return Gt | True gradient (Gt fixed) | Unbiased, high variance, no bootstrapping |
| TD(0) + VFA | R + γv̂(S';w) | Semi-gradient (ignore ∇v̂(S')) | Biased, low variance, fast, can diverge |
| TD(n) + VFA | n-step return G(n) | Semi-gradient | Intermediate bias/variance trade-off |
Linear Function Approximation
Feature Vector x(s)
A fixed feature map converting each state s into an n-dimensional real-valued vector. Features encode relevant state information — must be chosen by the designer.
The gradient of v̂(s,w) w.r.t. w is simply x(s) — making updates very efficient:
Weight Vector w
The learnable parameters. Each weight wj corresponds to feature j. Updating w affects the value of every state that has that feature active — this is generalisation.
Key: Tabular and state aggregation methods are special cases of linear FA.
State Aggregation Methods (Feature Encoding Strategies)
Coarse Coding
Circles/regions in continuous space. Feature xj(s) = 1 if state s lies inside region j, else 0. Multiple overlapping regions for one state.
Large circles → broad generalisation
Asymmetric shapes → directional generalisation
Tile Coding
Overlay multiple regular grids (tilings) offset from each other. State is represented by the set of tiles it falls in across all tilings.
Control generalisation by tile width
Binary features — very fast computation
Multiple tilings → fine resolution
Radial Basis Functions (RBF)
Each feature is a Gaussian centred on a prototype state cj. Smooth, continuous generalisation — the closer s is to cj, the higher xj(s).
Smooth generalisation
More flexible than binary coding
Sensitive to prototype placement and σ
Kanerva Coding
Prototype-based: features measure Hamming distance to stored prototype states. Designed for large binary/discrete state spaces.
Feature = similarity to prototype
Prototype selection matters greatly
Control with VFA: Extending to Q-Functions
Approximate Q-Function
Represent q̂(s,a;w) ≈ qπ(s,a) using a parameterised function. Input: state s (or state-action pair). Output: estimated action value.
Policy Improvement
Greedy or ε-greedy action selection based on estimated Q-values. Ensures exploration while gradually improving the policy.
On vs Off-Policy
On-policy (SARSA): use same policy for data and learning. Off-policy (Q-Learning): learn greedy policy while following exploratory behaviour.
Batch Methods: Least Squares Solutions
Instead of updating w one sample at a time, batch methods find the best fitting w over all experience at once.
LSTD — Least Squares TD
- Closed-form solution — no step-size α needed
- Converges directly to the TD fixed point
- Much more sample-efficient than online TD
- Extended to multi-step: LSTD(λ)
- Extended to action values: LSTDQ
LSPI — Least Squares Policy Iteration
- Interleaves LSTDQ with policy improvement (GPI)
- Converges faster than incremental methods
- Practical for moderate-sized problems
- Replaces Q-table with LSTDQ at each iteration
- Combines best of batch learning + policy iteration
| Property | Incremental (Online) | Batch Methods (LSTD) |
|---|---|---|
| Implementation | Simple | More complex |
| Data efficiency | Each transition used once | Re-uses all experience |
| Memory | Constant (no storage) | Must store experience D |
| Step-size α | Required, sensitive | Not needed (closed-form) |
| Best for | Streaming data, online learning | Fixed dataset, sample efficiency critical |
Deep Reinforcement Learning & DQN
Why Neural Networks?
Universal Approximator
Can represent any continuous function given enough capacity. No need to hand-craft features.
Distributed Representations
Exponentially fewer nodes than shallow networks for the same function. Deep = efficient.
Learned Features
No hand-crafted features — features emerge from raw inputs (pixels, sensor readings). Learn directly from observations.
Gradient-Based Training
Learnable via stochastic gradient descent and backpropagation. Same machinery used in all of deep learning.
Convolutional Neural Networks (CNNs) in RL
Key architecture for processing spatial inputs (game screens, sensor grids, camera images).
📍 Local Structure
Not fully connected — each neuron sees only a local receptive field. Dramatically reduces parameters. Considers local spatial relationships.
🔁 Weight Sharing
All neurons in a feature map share the same weights. The same filter detects the same feature (e.g., edge) at different locations in the image.
🗺️ Feature Maps
A convolutional filter defines a feature map: all nodes detect the same feature across the input. Multiple filter banks learn diverse features simultaneously.
🗜️ Pooling
Sub-sampling operations compress feature maps, extracting salient information and favouring generalisation to new inputs.
What Transfers from Tabular RL — and What Doesn't
✅ What Transfers
- TD and MC learning update rules
- Double learning (Double Q-Learning)
- Experience replay concept
- ε-greedy exploration
- SARSA / Q-learning structure
❌ What Doesn't Transfer Easily
- Least squares TD/MC (closed-form becomes intractable)
- Exact convergence guarantees
- Tabular policy iteration (too many states)
Why Plain Q-Learning + NN is Unstable
Problem 1: Correlated Samples
Sequential transitions are highly correlated — violates the i.i.d. assumption of SGD. Causes oscillation and divergence during training.
Problem 2: Non-Stationary Targets
As w updates, the TD target r + γ max Q(s',a';w) also shifts. The agent chases a moving target — nothing to anchor learning.
DQN's Two Solutions
🔀 Experience Replay
Store transitions (s,a,r,s') in a replay buffer D. Sample random mini-batches to train. Breaks correlations, improves sample efficiency.
🧊 Fixed Q-Targets (Target Network)
Use a separate frozen network w⁻ to compute TD targets. Update w⁻ ← w only every C steps. Stabilises the target during updates.
The target y doesn't move during learning — only w changes. Restart target sync every C steps.
DQN Architecture
(pixels / sensors)
Q(s,a₂;w)
Q(s,a₃;w)
= r + γ max Q(s',a';w⁻)
Model-Based Reinforcement Learning
Three Approaches to Solving MDPs
Dynamic Programming
Assumes a complete, known model. No interaction needed. Requires full knowledge of P(s'|s,a) and R(s,a).
Model-Free RL
No model required. Interact with the environment directly. SARSA, Q-learning, MC. Works even when dynamics are unknown.
Model-Based RL
Learn a model from experience, then plan using it. Dyna-Q combines both approaches. Model errors can compound over planning horizons.
What is a Model?
A model Mη is an approximate representation of the MDP. States and actions are the same as the real problem. The dynamics are parameterised by weights η. Learning a model is a supervised learning problem: choose functional form → pick a loss function (e.g. MSE) → find η that minimises empirical loss.
Types of Models
| Model Type | What it Predicts | Properties |
|---|---|---|
| Expectation Model | Expected next state: 𝔼[st+1|st,at] | Simple, tractable; ignores stochasticity; mostly linear |
| Stochastic / Generative | Full distribution: s' ~ P(·|s,a) | Captures uncertainty; supports risk-sensitive planning |
| Full Model | Both P(s'|s,a) and R(s,a) jointly | Most expressive; hardest to learn accurately |
| Decomposed Dynamics | Separate ηT for transitions, ηR for rewards | Flexible; allows Table Lookup / Linear / NN sub-models |
Learning vs Planning
Learning
The environment is initially unknown. The agent interacts with the environment to gather data and learn from real experience. Direct RL updates improve value / policy.
Planning
A model of the environment is given or has been learnt. The agent plans inside the model without further real interaction — simulates experience cheaply.
Sample-Based Planning
Instead of solving the full MDP analytically, sample experience from the model and treat it as real experience. Any model-free RL algorithm can be applied to simulated data.
MC Control
Simulate full episodes from model. Compute returns. Update Q-values using MC estimates. Works well when model is accurate.
SARSA (On-Policy)
Simulate step-by-step transitions. Apply on-policy TD update. Works with partial episodes — more data-efficient than full MC.
Q-Learning (Off-Policy)
Sample transitions from model. Apply off-policy TD with max operator. Flexible — behaviour and target policies can differ.
Advantages of Sample-Based Planning:
- Avoids expensive full DP sweeps — no need to enumerate all states
- Can use any model-free algorithm unchanged on simulated data
- Naturally handles large state spaces via sampling
- Computationally flexible — run as many samples as budget allows
- Combines smoothly with real experience (Dyna-Q architecture)
Handling Model Inaccuracy
Model-based RL is only as good as the estimated model. Performance is bounded by the optimal policy for the approximate MDP, not the true one.
1. Model-Free Fallback
When the model is wrong, fall back to model-free RL using real experience. Auto-detects divergence between model predictions and real outcomes.
2. Bayesian Uncertainty
Reason about model uncertainty using Bayesian methods (distribution over η). Plan under model uncertainty — balances exploration and exploitation at the model level.
3. Combined Approaches
Combine model-based and model-free in a single algorithm (e.g. Dyna-Q). Real experience for model updates + safety; model for planning.
Dyna-Q & Forward Search Planning
Dyna-Q Algorithm — Step by Step
Take action at in the real environment. Observe reward rt+1 and next state st+1.
Apply Q-learning directly from real experience: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') − Q(s,a)]
Update model Mη with observed transition (st, at, rt+1, st+1). Supervised learning on dynamics.
Repeat n times: sample random (s,a) from memory, simulate s' ~ Mη, apply Q-learning update on simulated transition.
Integrating Model-Free and Model-Based
Model-Free RL (in Dyna-Q)
- No model required for this part
- Learn value function from real experience
- More robust to model errors
- Higher variance, less sample efficient
- Algorithms: Q-learning, SARSA, A3C, PPO
Model-Based Planning (in Dyna-Q)
- Learns model from real experience
- Plans value function using simulated experience
- More sample efficient
- Requires accurate model — errors compound
- Real experience is used for both model update and direct RL
Planning for Action Selection
Rather than improving a global value function, an agent can plan specifically to select the next action. The key insight: the distribution of states reachable from the current state differs from the global training distribution.
🎯 Local Accuracy
Focus all planning compute on states that will actually be encountered in the near future. Build a more accurate local value function for nearby states rather than the full state space.
🔍 Handles Inaccurate Models
Inaccuracies in the model may lead to interesting exploration of the local neighbourhood rather than propagating errors into the global value function.
⏱️ Anytime Planning
The agent can use however much compute is available before needing to act. More compute → better action selection — without retraining from scratch.
Forward Search
Select the best action by building a search tree rooted at the current state and looking ahead using the model. No need to solve the whole MDP — just the sub-MDP starting from now.
Root = Current State
The search tree starts at current state st. All branches represent possible future trajectories from here.
Model-Based Expansion
Use the learned or given model to expand tree nodes. At each node, simulate outcomes of each possible action.
Local Sub-MDP
Only solve the portion of the MDP reachable from now. Ignores irrelevant distant states — focused planning.
Best Action at Root
After building the tree, select the action at root with highest estimated value. Repeat each timestep.
Examples: Minimax search (chess), Monte Carlo Tree Search (MCTS, AlphaGo), Model Predictive Control (MPC) — all instances of forward search with different evaluation strategies.
Simulation-Based Search & MCTS
Simulation-Based Search
How It Works
- Simulate K episodes starting from current state st
- Use the model Mη as a simulator
- Apply model-free RL to simulated episodes
- Estimate Q(st, a) for each action a
- Select action with highest estimated Q-value
Why It Works
- Avoids enumerating all states explicitly
- Sampling handles large/continuous action spaces
- Model-free RL algorithms are reused unchanged
- Can allocate compute where it matters most
- No need for global value function — just local estimates
Monte Carlo Tree Evaluation
Given model Mη and simulation policy π, simulate K episodes from current state st. Evaluate each action by mean return. Select action with maximum value.
📊 Unbiased Estimates
MC returns are unbiased estimates of the true value function — no bootstrapping errors, just variance. More simulations → lower variance.
📦 Black-Box Models
Only requires the ability to sample from the model — no need to know transition probabilities explicitly. Any simulator counts.
⚖️ Variance vs K
More simulations K → lower variance but higher compute. Trade-off between accuracy and efficiency. Choose K based on compute budget.
🎯 Simulation Policy Matters
A better simulation policy reduces variance and focuses simulations on relevant parts of the state space. Learned rollout policies accelerate convergence.
MCTS: 4-Step Algorithm
1. SELECT
Traverse the existing tree from root to a leaf node. At each step, pick actions according to Q(s,a) — balancing exploration and exploitation within the tree (e.g. UCB).
2. EXPAND
Add one new node to the tree by expanding the leaf state. Choose an untried action and create a child node. Grows the tree incrementally — one node per simulation.
3. ROLLOUT
From the new node, simulate a complete episode using the fixed rollout policy (often random) until termination. Provides an unbiased return estimate.
4. BACKUP
Propagate the return G back up the entire path from new node to root. Update Q(s,a) for all visited state-action pairs. Tree gets better with each simulation.
Why MCTS is One of the Most Powerful Planning Algorithms
Best-First Search
Concentrates simulations on the most promising branches. Unlike exhaustive methods, it doesn't waste compute on bad moves. The tree grows asymmetrically toward high-value regions.
Anytime Algorithm
Can be stopped at any point and will return the best action found so far. More compute = better decisions (a valid answer is always available). Ideal for real-time or resource-constrained settings.
Black-Box Compatible
MCTS only requires the ability to sample transitions — it does not need explicit P(s'|s,a). Any simulator counts as a valid model, making MCTS broadly applicable to real-world systems.
Breaks Curse of Dimensionality
Evaluates states dynamically through sampling rather than full DP sweeps. Scales to enormous state spaces (e.g. Go: 10¹⁷⁰ states). Only visited nodes are stored — memory proportional to tree, not state space.
Search Tree + Value Function Approximation: The Power Combination
| Approach | Generalisation | Large State Spaces | Example |
|---|---|---|---|
| Model-Free RL (table lookup) | ❌ None | ❌ Cannot store all states | Q-table |
| Simulation-Based Search (table lookup) | ❌ None | ⚠️ Only reachable states | Basic MCTS |
| Simulation-Based Search + VFA | ✅ Similar states get similar values | ✅ Compact parameterised estimates | AlphaGo, AlphaZero |
Glossary
Key terms and definitions from the RL course
Quiz
Test your knowledge across all topics · 25 Questions
Cheat Sheet
Key formulas and algorithm summaries for quick reference
Return & Discount
γ=0: only immediate reward | γ=1: equal weight to all future rewards
MDP Tuple
S States | A Actions | P Transitions
R Rewards | γ Discount
Value Functions
Bellman Expectation
Bellman Optimality
Policy Iteration
Eval: Vk+1(s) ← Σaπ Σs'P[R+γVk(s')]
Improve: π'(s) = argmaxaQπ(s,a)
Repeat until stable.
Value Iteration
No separate eval step — directly uses optimality equation.
Monte Carlo
Model-free. Episodic only. Zero bias, high variance.
TD(0)
SARSA (On-Policy)
Uses (S,A,R,S',A') — next action from same policy.
Q-Learning (Off-Policy)
Uses greedy max — independent of behaviour policy.
Double Q-Learning
Decouples selection from evaluation — reduces overestimation bias in Q-Learning.
Linear VFA
x(s) = feature vector; w = learnable weights. Tabular is a special case.
DQN Loss
w = online network; w⁻ = frozen target network. Replay buffer + target net = stable training.
TD(n) Return
TD(0)=1 step; TD(∞)=Monte Carlo. n controls bias-variance trade-off.
Dyna-Q
Real step: Q-update from real (s,a,r,s')
Model update: Mη ← (s,a,r,s')
Planning: n×Q-update from Mη samples
MCTS Steps
1. Select — traverse tree by Q(s,a)
2. Expand — add new leaf node
3. Rollout — simulate to terminal
4. Backup — propagate return G up
Algorithm Selection Guide
Policy-Based Reinforcement Learning
Lecture 7 · Policy Gradients, Actor-Critic, PPO & Modern Algorithms · Inês Castelhano · May 2026
Three Approaches to Reinforcement Learning
| Approach | How it works | Advantages | Disadvantages |
|---|---|---|---|
| Model-Based RL Dyna-Q, AlphaZero, MBPO |
Learn a model of the environment, then plan with it | Easy to learn a model (supervised learning). Learns 'all there is to know' from the data. | Uses compute on irrelevant details. Planning (computing policy from model) is non-trivial and expensive. |
| Value-Based RL Q-Learning, DQN, SARSA |
Learn V(s) or Q(s,a), derive policy greedily | Easy to generate policy (just take argmax). Close to true objective. Well-understood algorithms exist. | Still not the true objective. Small value error can lead to larger policy error. Struggles with continuous actions. |
| Policy-Based RL REINFORCE, PPO, SAC, DDPG |
Directly parametrise and optimise πθ(a|s) | Directly optimises the true objective. Easily extended to continuous action spaces. Can learn stochastic policies. | Could get stuck in local optima. Specific knowledge may not generalise well. |
Actor-Critic: The Best of Both Worlds
Actor-Critic methods combine value-based (critic) and policy-based (actor) components. The critic estimates the value function; the actor directly optimises the policy using the critic's signal. This reduces the high variance of pure policy gradient while keeping the direct policy optimisation.
Policy-Based Reinforcement Learning
✅ Advantages
- True objective: directly maximises expected cumulative return — no intermediary value function
- Continuous actions: trivially extends to infinite/continuous action spaces
- Stochastic policies: can learn genuinely random policies (optimal in partially observable settings)
- Simpler in some domains: sometimes policies are simple while values are complex
❌ Disadvantages
- Local optima: gradient ascent can converge to a suboptimal policy
- Poor generalisation: obtained knowledge can be specific, doesn't always transfer
- Doesn't extract all info: when used in isolation, doesn't leverage value function information as efficiently
- High variance: policy gradient estimates have high variance — needs many samples
Why Continuous Action Spaces are Hard for Value-Based Methods
🗂️ Cannot Tabulate Q-Values
The action space is infinite — we cannot store Q(s,a) for every a. A function approximator must map (state, action) pairs to values, requiring a network that takes both as input.
🔍 Greedy Max is Intractable
Computing argmaxa Q(s,a) requires solving an inner optimisation problem at every single step. Expensive and often intractable for high-dimensional action spaces.
🎲 Exploration is Hard
Exploration in high-dimensional continuous spaces is challenging. Gaussian policies provide natural exploration via variance σ² without explicit ε-greedy schemes.
Policy Objective Functions
Given policy πθ(a|s), find the best parameters θ. We need a scalar measure of policy quality:
Episodic Return Objective
In episodic environments: maximise average total return per episode.
Used when episodes have a clear start and end (Atari, MuJoCo tasks).
Average Reward Objective
In continuing environments: maximise average reward per timestep.
Used for continuous control without natural episode boundaries.
Stochastic Policies
Deterministic Optimal Exists (Full Observability)
In fully observable MDPs, there is always an optimal deterministic policy π*(s) = a*. Stochasticity is not needed — but it may still help during training via exploration.
Stochastic Optimal (Partial Observability)
Most real-world problems are not fully observable. Two different true states may look identical to the agent. The optimal policy must then be stochastic — mixing actions covers the ambiguity.
Smoother Optimisation Landscape
The search space is smoother for stochastic policies. Deterministic policies can have sharp discontinuities — small parameter changes cause sudden policy jumps. Stochastic policies provide gradients everywhere.
Built-in Exploration
Stochastic policies provide automatic exploration during learning without requiring separate ε-greedy logic. The policy entropy naturally encourages trying different actions.
Stochastic vs Deterministic Policies
| Property | Deterministic π(s) = a | Stochastic π(a|s) |
|---|---|---|
| Optimal for fully observable MDP | ✓ Yes | ✓ Also works |
| Optimal for partially observable | ✗ May not be | ✓ Yes (mixing) |
| Gradient-based optimisation | ⚠ Discontinuities | ✓ Smooth everywhere |
| Built-in exploration | ✗ Needs ε-greedy | ✓ Via entropy |
| Used in DPG / TD3 | ✓ Yes (DDPG, TD3) | ✓ Yes (PPO, SAC) |
Computing the Policy Gradient
Step 1: The Policy Gradient Problem
We want to compute:
Problem: ∇θ P(τ;θ) is hard — it requires knowing the environment dynamics.
Step 2: The Log-Derivative Trick
Using the identity: ∇θ P(τ;θ) = P(τ;θ) · ∇θ log P(τ;θ)
Now it's an expectation of gradients — we can estimate it by sampling trajectories!
🎯 The Score Function
The score function is the derivative of the log policy:
Using the log-derivative trick, this gives us the policy gradient:
Intuitively: increase the log-probability of actions that led to good returns, decrease it for bad returns.
REINFORCE Algorithm (Monte Carlo Policy Gradient)
Why This Works
- Gt acts as a scalar weight — positive returns reinforce actions, negative penalise
- Unbiased estimate of the true gradient
- No model of the environment needed
- Works for any differentiable policy class
The Variance Problem
- Gt can vary hugely across episodes — high variance
- Learning is slow and noisy
- Baseline trick: subtract a baseline b(s) from Gt to reduce variance without bias: θ ← θ + α · ∇θ log πθ(At|St) · (Gt − b(St))
- Best baseline: the value function V(s) → leads to Actor-Critic
Differentiable Policy Classes
🔢 Softmax Policy — Discrete Actions
Weight actions using a linear combination of features φ(s,a). Probability proportional to exponentiated weight:
Score Function:
The score is: the feature of the chosen action minus the expected feature under the current policy. Pushes θ toward features of rewarding actions.
📈 Gaussian Policy — Continuous Actions
Mean is a linear combination of state features. Variance σ² controls exploration:
Score Function:
Updating θ shifts the mean of the Gaussian toward rewarding actions. Variance σ² controls built-in exploration — larger σ = more exploration.
Comparison
| Property | Softmax (Discrete) | Gaussian (Continuous) |
|---|---|---|
| Action space | Finite set {a₁, a₂, ..., aₙ} | ℝ or bounded interval |
| Output distribution | Categorical (sums to 1) | Normal distribution 𝒩(μ, σ²) |
| Exploration control | Temperature / entropy regularisation | Variance σ² parameter |
| Score function closed-form | ✓ Yes | ✓ Yes |
| Greedy policy | argmax probability = action with highest weight | Take mean μ(s) deterministically |
| Example algorithms | REINFORCE, A2C, PPO (discrete) | PPO, SAC, DDPG, TD3 |
Advantage Function: Reducing Variance
Instead of weighting by raw return Gt, use the advantage function A(s,a) = Q(s,a) − V(s). This tells us how much better action a is compared to the average:
Actor-Critic Methods
Actor
Parameters θ
Policy πθ(a|s)
Selects actions
Critic
Parameters w
Value function Qw(s,a)
Evaluates actions
The Training Loop
Critic Options: What the Critic Estimates
| Critic Target | Formula | Properties |
|---|---|---|
| MC Return | Gt | Unbiased, high variance, must wait for episode end |
| TD(0) V-function | R + γV(S') | Low variance, some bias, online updates |
| Q-function Q(s,a) | R + γQ(S',A') | Per-action signal, used in Q-actor-critic |
| Advantage A(s,a) | Q(s,a) − V(s) | Centred around 0, lower variance than raw return |
| GAE (Generalised Adv.) | Σ (γλ)ᵏδt+k | Interpolates TD and MC, used in PPO |
PPO & Modern Actor-Critic Algorithms
PPO — Proximal Policy Optimisation
Core Problem
Vanilla policy gradient: a large update step can dramatically change the policy and collapse performance. Once in a bad region, it's hard to recover.
This ratio r(θ) measures how much the policy has changed from the last update.
PPO's Solution: Clipped Objective
Clip r(θ) within [1−ε, 1+ε] (typically ε = 0.2). This limits how far the new policy can deviate from the old one:
Where  is the advantage estimate. The clip prevents exploiting large policy ratio updates.
Entropy Bonus
PPO adds an entropy bonus to the objective to prevent premature convergence to a deterministic policy:
Higher entropy = more exploration. c is a tunable coefficient.
Why PPO is the Industry Default
- Simple to implement (no second-order KL constraint)
- Works for both discrete and continuous actions
- Stable enough for RLHF fine-tuning of LLMs (ChatGPT)
- Better sample efficiency than REINFORCE
- Strong baseline across diverse environments
Modern Actor-Critic Algorithm Family
A2C / A3C
Asynchronous Advantage Actor-Critic. A3C runs parallel workers asynchronously for faster training. A2C is the synchronous version — simpler and often just as effective.
- On-policy: data from current policy only
- Multiple parallel environments → diversity
- Advantage function reduces variance
PPO
Proximal Policy Optimisation. Clips the probability ratio to prevent destructive updates. Industry default for continuous control and RLHF fine-tuning of LLMs.
- On-policy with clipped surrogate loss
- Entropy bonus for exploration
- Stable and scalable
SAC
Soft Actor-Critic. Off-policy method that balances expected reward with a policy entropy term — encouraging exploration and preventing premature convergence.
- Off-policy: replay buffer → sample efficient
- Maximum entropy framework: J(π) = Σ 𝔼[r + αH(π)]
- Automatic entropy tuning
TD3
Twin Delayed DDPG. Uses two critic networks and takes the minimum Q-value to reduce overestimation bias. Actor updates less frequently than critics for stability.
- Two Q-networks → min Q → less overestimation
- Delayed actor updates (every 2 critic steps)
- Target policy smoothing (adds noise to actions)
Algorithm Comparison: When to Use What
| Algorithm | On/Off Policy | Actions | Key Trick | Best For |
|---|---|---|---|---|
| REINFORCE | On-policy | Both | MC returns | Simple baselines, teaching |
| A2C / A3C | On-policy | Both | Parallel workers | Fast training, Atari |
| PPO | On-policy | Both | Clipped ratio | Most tasks — strong default |
| DDPG | Off-policy | Continuous | Deterministic actor | Simple continuous control |
| TD3 | Off-policy | Continuous | Twin critics + delay | Continuous control |
| SAC | Off-policy | Continuous | Max entropy | Sample-efficient continuous control |
📖 Deep Dives
Book-derived scientific depth — theory, proofs, and formal frameworks from Deep Reinforcement Learning with Python (Sanghi)
Convergence Properties of RL Algorithms
Not all RL algorithms are guaranteed to converge. The convergence behaviour depends on three key factors: whether bootstrapping is used, whether learning is on-policy or off-policy, and whether the function approximator is linear or nonlinear. This table (adapted from Sutton & Barto and Sanghi Chapter 5) summarises the landscape:
| Algorithm | Bootstrapping | On/Off Policy | Tabular | Linear FA | Nonlinear FA |
|---|---|---|---|---|---|
| Monte Carlo | No | On-Policy | ✓ Yes | ✓ Yes | ✓ Yes |
| TD(0) | Yes | On-Policy | ✓ Yes | ✓ Yes | ⚠ No |
| TD(0) | Yes | Off-Policy | ✓ Yes | ⚠ No | ✗ No |
| Q-Learning | Yes | Off-Policy | ✓ Yes | ⚠ No | ✗ No |
| SARSA | Yes | On-Policy | ✓ Yes | ✓ Yes | ⚠ No |
| DQN | Yes | Off-Policy | — | — | ⚠ Empirical |
The Deadly Triad
Convergence failures occur when all three of these combine simultaneously:
- Bootstrapping — using estimates to update estimates
- Off-policy — behaviour policy ≠ target policy
- Function approximation — generalising across states
Each individually is fine. Two together is often fine. All three → potential divergence.
Why Monte Carlo Always Converges
MC uses no bootstrapping — targets are actual returns Gt, not estimated values. This means:
- The update target is stationary (doesn't move as weights change)
- The update is a true gradient of the MSE loss
- Stochastic gradient descent convergence theorems apply
The cost: high variance, no online learning, must wait for episode end.
Linear FA + On-Policy TD
This combination converges to the TD fixed point — a point near (but not at) the true MSE minimum. The error bound is:
The TD fixed point lies within a bounded factor of the best linear approximation possible.
Semi-Gradient Methods: The Formal Justification
When we use bootstrapped targets with function approximation, we face a fundamental problem: the update target itself depends on the weights w. This makes true gradient descent impossible.
The True Gradient Objective
We want to minimise the Mean Squared Value Error:
Taking the true gradient:
The Problem with TD Targets
In TD(0), we replace vπ(s) with the bootstrap target:
Notice: the target Ut depends on w! So the true gradient would be:
This requires differentiating through both v̂(St, w) and v̂(St+1, w).
🎯 The Semi-Gradient Approximation
Instead, we ignore the derivative of the next-state value. We treat Ut as if it were a fixed target that doesn't depend on w:
Only the current state's gradient is computed. The next-state term is dropped.
Why Drop the Next-State Gradient?
- The target Ut is used as a "label" — differentiating through it would create a circular dependency
- This mirrors how supervised learning treats targets as fixed
- For linear FA on-policy, the resulting update still converges (to the TD fixed point)
- This is exactly what deep learning frameworks do:
target_net.detach()in PyTorch stops gradient flow through the target network
Monte Carlo = True Gradient
MC is the only method that performs true gradient descent on J(w), because the return Gt does not depend on w:
Here Gt is a fixed observed return. No approximation is made — this is true SGD. That's why MC converges everywhere.
Semi-Gradient TD vs Full-Gradient (Residual Gradient)
| Property | Semi-Gradient TD | Full-Gradient (Residual) |
|---|---|---|
| Gradient computation | ∇v̂(St, w) only | ∇v̂(St, w) − γ·∇v̂(St+1, w) |
| Convergence (linear, on-policy) | ✓ Yes (TD fixed point) | ✓ Yes (true minimum) |
| Convergence (nonlinear) | ✗ Not guaranteed | ⚠ Slow, unstable |
| Computational cost | One forward pass | Two forward passes |
| Used in practice | ✓ Always (DQN, etc.) | ✗ Rarely |
Backup Diagrams: Formal Theory
Backup diagrams are a compact visual language for describing how value information flows in different RL algorithms. Each diagram encodes exactly which transitions are considered and how deep the lookahead goes.
- Full-width: considers ALL actions
- Full-width: considers ALL next states
- Depth 1: only one step lookahead
- Uses model (transition probabilities)
- Bootstraps from V(s')
- Sample-width: only ONE sampled trajectory
- Full depth: all the way to episode end
- No bootstrapping — actual return used
- No model needed (model-free)
- High variance, zero bias
- Sample-width: ONE sampled transition
- Depth 1: stops at next state
- Bootstraps from V(St+1)
- Low variance, some bias
- Online learning possible
- Sample-width: ONE trajectory
- Depth n: n-step lookahead
- Bootstraps at step n
- Interpolates MC↔TD(0) via n
- n=1: TD(0); n=∞: MC
Off-Policy Monte Carlo Control
Off-policy learning separates the policy used to generate experience (behaviour policy b) from the policy being optimised (target policy π). This enables learning about the optimal policy while still exploring.
Two Policies
| Property | Behaviour Policy b | Target Policy π |
|---|---|---|
| Role | Generates episodes | Being improved |
| Must cover π? | Yes (coverage condition) | — |
| Typical form | ε-soft (explores) | Greedy (deterministic) |
| Used at test time? | No | Yes |
Coverage Condition
For off-policy MC to work, the behaviour policy must cover the target policy:
Every action the target might take must have non-zero probability under the behaviour policy. Otherwise returns from some trajectories would be missing from the estimate.
Importance Sampling
Episodes are generated under b, but we want to evaluate π. Returns Gt from b must be reweighted by the probability ratio:
This ratio ρ is the importance sampling ratio — how much more (or less) likely the trajectory was under π compared to b.
Ordinary IS
Pros: Unbiased estimate of Vπ(s)
Cons: High variance (ρ can be very large or very small)
Weighted IS
Pros: Much lower variance; consistent estimator
Cons: Biased (but bias → 0 as samples → ∞)
Preferred in practice — variance reduction is usually worth the small bias.
Incremental Off-Policy MC Update (Weighted IS)
Using an incremental weighted average (avoids storing all returns):
Early exit: if the behaviour action diverges from the greedy target action, W → 0 and the episode no longer contributes. This stops learning on trajectories irrelevant to π.
Algorithm Landscape
The RL algorithm space can be mapped along multiple axes. Understanding where each algorithm sits helps you pick the right tool for a given problem.
Axis 1: Model-Free vs Model-Based
Axis 2: Value-Based vs Policy-Based
Axis 3: On-Policy vs Off-Policy
Axis 4: Bootstrapping Depth
TD(λ) provides an elegant weighted combination of all n-step returns simultaneously using eligibility traces.
Algorithm Map
Value-Based
Policy-Based
Actor-Critic
Generalised Policy Iteration with Function Approximation
Generalised Policy Iteration (GPI) is the universal framework underlying almost all RL algorithms. It describes the interplay between policy evaluation and policy improvement.
Policy Evaluation
Given π, estimate Vπ(s) or Qπ(s,a)
Policy Improvement
Given Q(s,a), derive a better policy
GPI + Function Approximation
With tabular methods, GPI is well-understood and converges to the optimal policy. With function approximation, things become more complex:
What Changes with VFA
- Evaluation is approximate: V̂(s, w) ≠ Vπ(s) exactly
- Improvement causes distribution shift: changing π changes which states are visited (μ changes)
- Generalisation cuts both ways: updating one state affects others via shared weights
- No guarantee of convergence to exact optimum — GPI finds a near-optimal policy
Tabular Lookup as Special Case of Linear FA
Tabular RL is mathematically a special case of linear function approximation where the feature vector is a one-hot encoding:
Each weight ws is exactly the value of state s. No generalisation occurs — each state is independent. This proves tabular methods are a special case of the general VFA framework.
Semi-Gradient SARSA (GPI in Practice)
This is GPI in action: the TD target serves as the "evaluation" signal, while the ε-greedy policy derived from Q̂ is the "improvement" step. Both happen simultaneously every timestep.
DQN as GPI
DQN implements GPI at scale:
- Evaluation: minimise loss (Q̂(S,A,w) − y)² where y = R + γ·maxaQ̂(S', a, w⁻)
- Improvement: ε-greedy w.r.t. Q̂(·, ·, w)
- Target network w⁻: stabilises the evaluation target
- Replay buffer: decorrelates samples for stable SGD