Reinforcement Learning
Your complete interactive study guide covering all lectures, MDPs, Dynamic Programming, Monte Carlo, Temporal Difference Learning, and Deep RL.
Intro to RL
Core concepts, four pillars, reward hypothesis, agent components and categories.
Lecture 1MDPs & Bellman
Markov Decision Processes, value functions, Bellman equations and optimality.
Lecture 2DP & Monte Carlo
Policy iteration, value iteration, and model-free MC prediction and control.
Lecture 3Model-Free Methods
TD Learning, TD(n) multi-step, SARSA, GLIE convergence, MC vs TD trade-offs.
Lecture 4Model-Free Control
Q-Learning, Double Q-Learning, overestimation bias, value function approximation.
Lecture 5VFA & Planning
Gradient descent VFA, DQN full algorithm, Model-Based RL, Dyna-Q, MCTS.
Lecture 6Quiz
25 questions across all topics. Test your knowledge and prepare for the exam.
PracticeYour Progress
Introduction to Reinforcement Learning
Lecture 1 · Inês Castelhano · April 2026
What is Reinforcement Learning?
RL vs Other Machine Learning Paradigms
| Supervised Learning | Unsupervised Learning | Reinforcement Learning | |
|---|---|---|---|
| Data | Labelled examples | Unlabelled data | Experience / interaction |
| Feedback | Immediate, per sample | None (structure) | Delayed reward signal |
| Goal | Predict / classify | Find patterns | Maximise cumulative reward |
| Time | i.i.d. data | i.i.d. data | Sequential, non i.i.d. |
The Agent–Environment Interaction Loop
Reward Hypothesis
Reward Rt
Immediate feedback at each time step. Scalar signal indicating how well the agent is doing right now.
Return Gt
Total discounted future reward from time t. The agent's goal is to maximise Gt.
Discount Factor γ (Gamma)
γ ∈ [0, 1] controls how much future rewards are valued relative to immediate ones.
The Four Pillars of RL
1. Optimisation
Find an optimal way to make decisions, yielding best or very good outcomes.
2. Delayed Consequences
Decisions now have long-term impact. Early rewards can hinder the future. Credit must be assigned to past actions.
3. Exploration
Learning about the world by making decisions. Only reward is received for the decision made — this shapes what is learned.
4. Generalisation
Policy maps experience to action. The agent must generalise from seen situations to unseen ones.
What Makes RL Different?
Core Agent Components
The agent's core behaviour function — maps states to actions.
Prediction of future reward from a given state. Used to evaluate states.
Agent's internal representation of the environment.
Prediction vs. Control
Prediction
Evaluate the future for a given policy. Estimate the value function to measure how much reward the agent will accumulate.
Control
Optimise the future — find the best policy that maximises expected cumulative reward.
Learning vs. Planning
Learning
Environment is initially unknown. Agent interacts to discover optimal behaviour.
Planning
A model of the environment is given or learned. Agent plans inside the model without further real interaction.
Exploration vs. Exploitation
Exploration
Try new actions to discover potentially better strategies. Find more information about the environment.
Exploitation
Use what you already know to collect high rewards right now. Exploit known information to maximise reward.
Interactive: Explore vs. Exploit
Slide to change the exploration rate ε (probability of exploring):
Agent Categories
By Internal Components
| Category | Policy | Value Function | Description |
|---|---|---|---|
| Value-Based | Implicit | ✅ Yes | Learn value function; derive policy greedily (e.g. Q-Learning) |
| Policy-Based | ✅ Yes | ❌ No | Learn policy directly; no explicit value function (e.g. REINFORCE) |
| Actor-Critic | ✅ Yes | ✅ Yes | Learn both — actor updates policy, critic evaluates it |
By Use of a Model
Model-Free
No explicit model of the environment. Learns from direct interaction. Examples: Q-Learning, SARSA, Policy Gradient.
Model-Based
Builds or uses a model of the environment. Can plan ahead. Examples: Dynamic Programming, AlphaGo (MCTS).
Markov Decision Processes
Lecture 2 · Tabular Value-Based RL
The Markov Property
Once St is known, the history Ht is no longer needed. The current state is a sufficient statistic of the future.
Full Observability (MDP)
Agent sees the full environment state. Observation = Environment State = Agent State.
Partial Observability (POMDP)
Agent cannot directly observe environment state. Must construct a Markov state from history.
MDP: The Formal Definition
Delivery Driver Mapping
| MDP Component | Driver Problem |
|---|---|
| S — States | Location, active deliveries, time of day |
| A — Actions | Route choices, next destination |
| P — Transitions | Traffic uncertainty (same route → different outcomes) |
| R — Reward | Earnings from delivery minus delay penalty |
| γ — Discount | How much current delivery matters vs future opportunities |
Episodic vs Continuing Tasks
Episodic Tasks
Break into independent episodes. Each episode has a terminal state. Return is a finite sum.
Continuing Tasks
No natural endpoint. Return is an infinite sum — discount factor is essential to ensure convergence.
Value Functions
State Value Function Vπ(s)
How good is it to be in state s following policy π?
= Expected return starting from state s, following π forever.
Action Value Function Qπ(s,a)
How good is it to take action a in state s following policy π?
= Expected return starting from s, taking a, then following π.
Optimal Value Functions
v*(s)
Maximum value achievable from state s, over all possible policies.
Q*(s,a)
Maximum value of taking action a in state s, over all policies.
Bellman Equations
Bellman Expectation Equations
For Vπ
For Qπ
Bellman Optimality Equations
Optimal V*
Optimal Q*
Breaking Down the Bellman Equation
Frozen Lake — MDP Example
A classic grid-world problem: navigate from Start (S) to Goal (G) without falling into Holes (H).
MDP Components
Dynamic Programming & Monte Carlo
Lecture 3 · Planning by Dynamic Programming
Dynamic Programming
Key assumption: Full knowledge of the environment model (P and R).
Optimal Substructure
The optimal solution can be composed from optimal solutions to sub-problems.
Overlapping Subproblems
Sub-problems recur many times — their cached solutions can be reused.
DP Algorithm Summary
| Algorithm | Bellman Equation Used | Problem Type |
|---|---|---|
| Iterative Policy Evaluation | Expectation Equations | Prediction |
| Policy Iteration | Expectation + Greedy Improvement | Control |
| Value Iteration | Optimality Equations | Control |
Policy Iteration
Policy Evaluation Formula
At each iteration, every state's value is updated using the current estimates of neighbouring states. Repeating this until convergence gives Vπ.
Value Iteration
Policy Iteration
- Full policy evaluation at each step
- Guaranteed convergence to π*
- Slower per iteration, fewer iterations
Value Iteration
- One backup per iteration
- Also converges to optimal policy
- Faster per iteration, more iterations
Asynchronous DP
Standard DP updates all states in parallel (synchronous). Asynchronous DP updates states individually in any order — more efficient in practice.
In-Place DP
Only store one copy of value function — use immediately updated values.
Prioritised Sweeping
Update states with the largest Bellman error first. Maintain a priority queue.
Real-Time DP
Only update states relevant to the current agent trajectory.
DP Limitations
Monte Carlo Methods
Key requirement: MC methods are applied only to episodic tasks — must wait until episode ends.
First-Visit MC
Average returns only from the first occurrence of each state per episode.
S(St) ← S(St) + Gt
V(St) = S(St) / N(St)
Every-Visit MC
Average returns every time the state is visited in an episode.
Both converge to vπ(s) as N(s) → ∞ by the law of large numbers.
Incremental MC Updates
Update V(s) incrementally after each episode:
For non-stationary problems, use a fixed α (learning rate) instead of 1/N to give more weight to recent episodes.
MC Limitations
DP vs Monte Carlo — Comparison
| Property | Dynamic Programming | Monte Carlo |
|---|---|---|
| Model Required? | ✅ Yes (full model) | ❌ No (model-free) |
| Bootstrapping? | ✅ Yes | ❌ No (uses actual returns) |
| Works on Continuing Tasks? | ✅ Yes | ❌ Episodic only |
| Update timing | Each step | End of episode only |
| Bias | Low (exact model) | Zero bias |
| Variance | Low | High |
| Curse of dimensionality | Severe | Less severe (sampling) |
Model-Free Methods
Lecture 4 · Temporal Difference Learning · Inês Castelhano · May 2026
Temporal Difference Learning
TD is model-free (no knowledge of MDP transitions/rewards), learns directly from experience at each step, and uses bootstrapping — updating estimates using other estimates.
TD(0) Update Rule
Observed reward + bootstrapped future
Drives the update
Trading Example Mapping
| TD Concept | Stock Trading Interpretation |
|---|---|
| State St | Current market conditions (price, volatility, holdings) |
| Action At | Buy / Hold / Sell |
| Reward Rt+1 | Immediate profit or loss from the trade |
| TD Target | Observed short-term profit + estimated future portfolio value |
| TD Error δt | Difference between expected and observed market outcome |
| Update timing | After every market movement, not at market close |
• Updates at every step — online learning
• Works in continuing (non-episodic) settings
• Lower variance (only one step of randomness)
• Can learn before episode ends
• No model required (P and R unknown)
• Learns from sampled experience
• Scales to large problems without full sweeps
Multi-Step TD: TD(n) and TD(∞)
TD(n) Update Formula
Trading Example — TD(3)
Observe 3 days of market returns, then use your current estimate of future portfolio value. No need to wait for the full trading period.
Trading Example — TD(∞)
Wait until the end of the full trading episode and use the complete realized return — equivalent to Monte Carlo.
SARSA — TD for Action Values
GLIE — Greedy in the Limit with Infinite Exploration
∀s,a: limt→∞ N(s,a) = ∞
Every (state, action) pair must be tried infinitely often — no pair is ignored forever.
limt→∞ πt(a|s) = 𝟙[a = argmaxa' q(s,a')]
Eventually exploit more and explore less — ε must decay to 0.
Tabular SARSA converges to q* if the policy is GLIE.
Trading Example
Early in training, try all combinations of market conditions and actions (buy/hold/sell) many times — even bad ones — so no strategy is left unexplored. Over time, reduce ε so the agent exploits more of what it learned.
Monte Carlo vs TD Learning
| Property | Monte Carlo | TD Learning |
|---|---|---|
| Update timing | End of episode only | After every step (online) |
| Episodic tasks only? | ✅ Yes | ❌ No — works for continuing too |
| Bootstrapping | ❌ No — uses actual return | ✅ Yes — uses estimated V(S') |
| Bias | Zero (unbiased) | Some bias (bootstrapping) |
| Variance | High — full episode randomness | Low — only 1-step randomness |
| Markov property needed? | ❌ No | ✅ Yes — exploits Markov property |
| Memory needed | Store full episode | Only current transition |
Monte Carlo — Pros & Cons
TD Learning — Pros & Cons
On-Policy vs Off-Policy Learning
On-Policy Learning
The agent learns the value of the same policy it is currently following. The behaviour policy (generating actions) equals the target policy (being improved).
Off-Policy Learning
The agent learns the value of a different (target) policy from the behaviour policy generating data. Can learn from other agents or old data.
Why Off-Policy Matters
DP vs Monte Carlo vs TD — Unified View
| Property | Dynamic Programming | Monte Carlo | TD Learning |
|---|---|---|---|
| Model required? | ✅ Yes | ❌ No | ❌ No |
| Bootstrapping? | ✅ Yes | ❌ No | ✅ Yes |
| Sampling? | ❌ No (expectation) | ✅ Yes | ✅ Yes |
| Update timing | Every step (full sweep) | End of episode | Every step |
| Episodic only? | ❌ No | ✅ Yes | ❌ No |
| Bias | Low | Zero | Some |
| Variance | Low | High | Lower than MC |
Backup Diagrams — Visual Summary
DP — Full width, shallow
All actions & states. Full model. 1-step expectation.
MC — Narrow, deep
Sampled path to terminal state. Full return.
TD(0) — Narrow, shallow
Single sampled step. Bootstraps from V(s').
Model-Free Control
Lecture 5 · Q-Learning, Double Q-Learning, VFA Intro · Inês Castelhano · May 2026
Q-Learning — Off-Policy TD Control
Autonomous Driving Example
A self-driving car learns the fastest route to a destination by always estimating future decisions as optimal, even while still exploring different (suboptimal) driving behaviours. It learns from the actions it wishes it would take.
Interactive Q-Table Demo
Watch Q-values update step by step in a simple 3-state world (use Left/Right to navigate toward Goal):
Q-Learning Overestimation Problem
It uses the same values to both select and evaluate actions. With noisy approximations, overestimated values are selected more often, propagating the bias.
Driving Example
A self-driving car incorrectly estimates that a narrow, high-speed route is safer and faster than it really is — because a few successful experiences produced overly optimistic Q-values.
Double Q-Learning — Decoupling Selection from Evaluation
Update for Q (when Q is selected)
Update for Q* (when Q* is selected)
SARSA vs Q-Learning — Complete Comparison
| Property | SARSA | Q-Learning |
|---|---|---|
| Policy type | On-policy | Off-policy |
| Target action | A' sampled from behaviour policy (ε-greedy) | maxa'Q(S',a') — greedy target |
| Update | Q(S,A) + α[R+γQ(S',A')−Q(S,A)] | Q(S,A) + α[R+γmax Q(S',a')−Q(S,A)] |
| Safety | More conservative — safer in dangerous environments | Can be more aggressive — ignores current risk |
| Convergence | Converges to q* if GLIE (ε→0) | Converges to q* as long as each (s,a) visited ∞ often |
| Overestimation | Less prone (stochastic target) | Prone — uses greedy max |
| Best for | Safety-critical tasks, continuous exploration | Finding optimal policy efficiently |
Why Tabular Methods Fail — Motivation for VFA
Memory
Too many states/actions to store in memory
Speed
Too slow to learn each state individually
Observability
Individual states often not fully observable
Generalisation
Each state learned independently — no transfer between similar states
Function Approximation — The Solution
Approximate value function using a parameterised function with weights w: v̂(s;w) ≈ vπ(s)
Adjust weights w using MC or TD learning to minimise prediction error
Updating w for one state automatically improves predictions for similar states
Classes of Function Approximators
| Type | Advantages | Disadvantages |
|---|---|---|
| Tabular | Strong theory, exact values | Does not scale or generalise |
| Linear | Reasonable theory, efficient, stable | Requires good hand-crafted features |
| Non-Linear | Scales well, learns features automatically | Less well-understood theory, harder to train |
| Deep Neural Networks | Often best performance in practice | Less theory, harder to train, needs much data |
Challenges of Function Approximation in RL
Unlike supervised learning, RL has specific properties that make function approximation harder:
Experience is not i.i.d.
Successive time-steps are correlated. Supervised learning assumes independent, identically distributed samples — that assumption fails in RL.
Policy affects data distribution
The policy determines which data is collected, which affects what is learned — creating a tight feedback loop between learning and experience.
The Deadly Triad
Bootstrapping
Using TD targets (estimates of estimates). Introduces bias and non-stationarity in targets.
Off-Policy Learning
Learning from data generated by a different policy. Mismatched distributions between experience and target policy.
Function Approximation
Using parameterised functions instead of tables. Small errors in approximation can compound.
Value Function Approximation & Planning
Lecture 6 · VFA, DQN, Model-Based RL, Dyna-Q, MCTS · Inês Castelhano · May 2026
Why Do We Need Value Function Approximation?
Memory
Too many states/actions to store in a table. Backgammon has ~10²⁰ states; Go has ~10¹⁷⁰.
Speed
Too slow to visit and learn each state individually. Most states may never be seen during training.
Observability
Individual environment states often not fully observable. Partial observability means we can't index a table by state.
Generalisation
Each state is learned independently — no transfer between similar states. VFA automatically generalises across states.
Function Approximation: Three-Step Process
Estimate
Approximate value function using a parameterised function v̂(s;w) with weights w. The function can be linear, neural network, or any differentiable model.
Update
Adjust weights w using MC or TD learning to minimise prediction error. Gradient descent moves w in the direction of lower loss.
Generalise
Updating w for one state automatically improves predictions for similar states. This is the key advantage over tabular methods.
Classes of Function Approximators
| Approximator | Advantages | Disadvantages |
|---|---|---|
| Tabular | Strong theory, exact values, guaranteed convergence | Does not scale or generalise — one value per state |
| Linear | Reasonable theory, efficient updates, stable convergence | Requires good hand-crafted features; limited expressiveness |
| Non-Linear | Scales well, can represent complex functions | Less well-understood theory, harder to train |
| (Deep) Neural Networks | Often performs best; learns features from raw inputs (pixels) | Less well-understood theory; requires careful training tricks |
Gradient-Based Learning for Value Functions
Incremental Gradient Descent
- Update weights after each experience sample
- Small step in direction of reduced error
- Works online during episode
- Can follow non-stationary targets
Stochastic Gradient Descent (SGD)
- Randomly samples from experience
- Estimates the true gradient
- Converges to local minimum with decaying α
- Unbiased estimate of full gradient
Specific Challenges in RL (vs Supervised Learning)
🔄 Experience is Not i.i.d.
Successive timesteps are correlated. Standard supervised learning assumes independent, identically distributed samples — that assumption fails in RL. Consecutive (s,a,r,s') tuples share context.
🎭 Policy Affects Data
The agent's policy determines which data is collected. This creates a tight feedback loop: what we learn changes the policy, which changes what data we see next.
🌊 Changing Policy
Policy changes alter both the target values and the data distribution simultaneously — a double non-stationarity problem.
🎯 Bootstrapping
TD methods use current estimates as targets, which themselves keep changing. We're chasing a moving target — and the target moves because of our own updates.
🌍 Non-Stationary Targets
Other learning agents in multi-agent settings shift the effective dynamics. Even solo agents face non-stationarity due to policy improvement.
♾️ Large State Space
In continuous state spaces, we never visit the exact same state twice. No stable anchor point for estimates — states seen only once or never during training.
Update Targets: MC vs TD
| Method | Update Target Ut | Gradient Type | Properties |
|---|---|---|---|
| MC + VFA | Actual return Gt | True gradient (Gt fixed) | Unbiased, high variance, no bootstrapping |
| TD(0) + VFA | R + γv̂(S';w) | Semi-gradient (ignore ∇v̂(S')) | Biased, low variance, fast, can diverge |
| TD(n) + VFA | n-step return G(n) | Semi-gradient | Intermediate bias/variance trade-off |
Linear Function Approximation
Feature Vector x(s)
A fixed feature map converting each state s into an n-dimensional real-valued vector. Features encode relevant state information — must be chosen by the designer.
The gradient of v̂(s,w) w.r.t. w is simply x(s) — making updates very efficient:
Weight Vector w
The learnable parameters. Each weight wj corresponds to feature j. Updating w affects the value of every state that has that feature active — this is generalisation.
Key: Tabular and state aggregation methods are special cases of linear FA.
State Aggregation Methods (Feature Encoding Strategies)
Coarse Coding
Circles/regions in continuous space. Feature xj(s) = 1 if state s lies inside region j, else 0. Multiple overlapping regions for one state.
Large circles → broad generalisation
Asymmetric shapes → directional generalisation
Tile Coding
Overlay multiple regular grids (tilings) offset from each other. State is represented by the set of tiles it falls in across all tilings.
Control generalisation by tile width
Binary features — very fast computation
Multiple tilings → fine resolution
Radial Basis Functions (RBF)
Each feature is a Gaussian centred on a prototype state cj. Smooth, continuous generalisation — the closer s is to cj, the higher xj(s).
Smooth generalisation
More flexible than binary coding
Sensitive to prototype placement and σ
Kanerva Coding
Prototype-based: features measure Hamming distance to stored prototype states. Designed for large binary/discrete state spaces.
Feature = similarity to prototype
Prototype selection matters greatly
Control with VFA: Extending to Q-Functions
Approximate Q-Function
Represent q̂(s,a;w) ≈ qπ(s,a) using a parameterised function. Input: state s (or state-action pair). Output: estimated action value.
Policy Improvement
Greedy or ε-greedy action selection based on estimated Q-values. Ensures exploration while gradually improving the policy.
On vs Off-Policy
On-policy (SARSA): use same policy for data and learning. Off-policy (Q-Learning): learn greedy policy while following exploratory behaviour.
Batch Methods: Least Squares Solutions
Instead of updating w one sample at a time, batch methods find the best fitting w over all experience at once.
LSTD — Least Squares TD
- Closed-form solution — no step-size α needed
- Converges directly to the TD fixed point
- Much more sample-efficient than online TD
- Extended to multi-step: LSTD(λ)
- Extended to action values: LSTDQ
LSPI — Least Squares Policy Iteration
- Interleaves LSTDQ with policy improvement (GPI)
- Converges faster than incremental methods
- Practical for moderate-sized problems
- Replaces Q-table with LSTDQ at each iteration
- Combines best of batch learning + policy iteration
| Property | Incremental (Online) | Batch Methods (LSTD) |
|---|---|---|
| Implementation | Simple | More complex |
| Data efficiency | Each transition used once | Re-uses all experience |
| Memory | Constant (no storage) | Must store experience D |
| Step-size α | Required, sensitive | Not needed (closed-form) |
| Best for | Streaming data, online learning | Fixed dataset, sample efficiency critical |
Deep Reinforcement Learning & DQN
Why Neural Networks?
Universal Approximator
Can represent any continuous function given enough capacity. No need to hand-craft features.
Distributed Representations
Exponentially fewer nodes than shallow networks for the same function. Deep = efficient.
Learned Features
No hand-crafted features — features emerge from raw inputs (pixels, sensor readings). Learn directly from observations.
Gradient-Based Training
Learnable via stochastic gradient descent and backpropagation. Same machinery used in all of deep learning.
Convolutional Neural Networks (CNNs) in RL
Key architecture for processing spatial inputs (game screens, sensor grids, camera images).
📍 Local Structure
Not fully connected — each neuron sees only a local receptive field. Dramatically reduces parameters. Considers local spatial relationships.
🔁 Weight Sharing
All neurons in a feature map share the same weights. The same filter detects the same feature (e.g., edge) at different locations in the image.
🗺️ Feature Maps
A convolutional filter defines a feature map: all nodes detect the same feature across the input. Multiple filter banks learn diverse features simultaneously.
🗜️ Pooling
Sub-sampling operations compress feature maps, extracting salient information and favouring generalisation to new inputs.
What Transfers from Tabular RL — and What Doesn't
✅ What Transfers
- TD and MC learning update rules
- Double learning (Double Q-Learning)
- Experience replay concept
- ε-greedy exploration
- SARSA / Q-learning structure
❌ What Doesn't Transfer Easily
- Least squares TD/MC (closed-form becomes intractable)
- Exact convergence guarantees
- Tabular policy iteration (too many states)
Why Plain Q-Learning + NN is Unstable
Problem 1: Correlated Samples
Sequential transitions are highly correlated — violates the i.i.d. assumption of SGD. Causes oscillation and divergence during training.
Problem 2: Non-Stationary Targets
As w updates, the TD target r + γ max Q(s',a';w) also shifts. The agent chases a moving target — nothing to anchor learning.
DQN's Two Solutions
🔀 Experience Replay
Store transitions (s,a,r,s') in a replay buffer D. Sample random mini-batches to train. Breaks correlations, improves sample efficiency.
🧊 Fixed Q-Targets (Target Network)
Use a separate frozen network w⁻ to compute TD targets. Update w⁻ ← w only every C steps. Stabilises the target during updates.
The target y doesn't move during learning — only w changes. Restart target sync every C steps.
DQN Architecture
(pixels / sensors)
Q(s,a₂;w)
Q(s,a₃;w)
= r + γ max Q(s',a';w⁻)
Model-Based Reinforcement Learning
Three Approaches to Solving MDPs
Dynamic Programming
Assumes a complete, known model. No interaction needed. Requires full knowledge of P(s'|s,a) and R(s,a).
Model-Free RL
No model required. Interact with the environment directly. SARSA, Q-learning, MC. Works even when dynamics are unknown.
Model-Based RL
Learn a model from experience, then plan using it. Dyna-Q combines both approaches. Model errors can compound over planning horizons.
What is a Model?
A model Mη is an approximate representation of the MDP. States and actions are the same as the real problem. The dynamics are parameterised by weights η. Learning a model is a supervised learning problem: choose functional form → pick a loss function (e.g. MSE) → find η that minimises empirical loss.
Types of Models
| Model Type | What it Predicts | Properties |
|---|---|---|
| Expectation Model | Expected next state: 𝔼[st+1|st,at] | Simple, tractable; ignores stochasticity; mostly linear |
| Stochastic / Generative | Full distribution: s' ~ P(·|s,a) | Captures uncertainty; supports risk-sensitive planning |
| Full Model | Both P(s'|s,a) and R(s,a) jointly | Most expressive; hardest to learn accurately |
| Decomposed Dynamics | Separate ηT for transitions, ηR for rewards | Flexible; allows Table Lookup / Linear / NN sub-models |
Learning vs Planning
Learning
The environment is initially unknown. The agent interacts with the environment to gather data and learn from real experience. Direct RL updates improve value / policy.
Planning
A model of the environment is given or has been learnt. The agent plans inside the model without further real interaction — simulates experience cheaply.
Sample-Based Planning
Instead of solving the full MDP analytically, sample experience from the model and treat it as real experience. Any model-free RL algorithm can be applied to simulated data.
MC Control
Simulate full episodes from model. Compute returns. Update Q-values using MC estimates. Works well when model is accurate.
SARSA (On-Policy)
Simulate step-by-step transitions. Apply on-policy TD update. Works with partial episodes — more data-efficient than full MC.
Q-Learning (Off-Policy)
Sample transitions from model. Apply off-policy TD with max operator. Flexible — behaviour and target policies can differ.
Advantages of Sample-Based Planning:
- Avoids expensive full DP sweeps — no need to enumerate all states
- Can use any model-free algorithm unchanged on simulated data
- Naturally handles large state spaces via sampling
- Computationally flexible — run as many samples as budget allows
- Combines smoothly with real experience (Dyna-Q architecture)
Handling Model Inaccuracy
Model-based RL is only as good as the estimated model. Performance is bounded by the optimal policy for the approximate MDP, not the true one.
1. Model-Free Fallback
When the model is wrong, fall back to model-free RL using real experience. Auto-detects divergence between model predictions and real outcomes.
2. Bayesian Uncertainty
Reason about model uncertainty using Bayesian methods (distribution over η). Plan under model uncertainty — balances exploration and exploitation at the model level.
3. Combined Approaches
Combine model-based and model-free in a single algorithm (e.g. Dyna-Q). Real experience for model updates + safety; model for planning.
Dyna-Q & Forward Search Planning
Dyna-Q Algorithm — Step by Step
Take action at in the real environment. Observe reward rt+1 and next state st+1.
Apply Q-learning directly from real experience: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') − Q(s,a)]
Update model Mη with observed transition (st, at, rt+1, st+1). Supervised learning on dynamics.
Repeat n times: sample random (s,a) from memory, simulate s' ~ Mη, apply Q-learning update on simulated transition.
Integrating Model-Free and Model-Based
Model-Free RL (in Dyna-Q)
- No model required for this part
- Learn value function from real experience
- More robust to model errors
- Higher variance, less sample efficient
- Algorithms: Q-learning, SARSA, A3C, PPO
Model-Based Planning (in Dyna-Q)
- Learns model from real experience
- Plans value function using simulated experience
- More sample efficient
- Requires accurate model — errors compound
- Real experience is used for both model update and direct RL
Planning for Action Selection
Rather than improving a global value function, an agent can plan specifically to select the next action. The key insight: the distribution of states reachable from the current state differs from the global training distribution.
🎯 Local Accuracy
Focus all planning compute on states that will actually be encountered in the near future. Build a more accurate local value function for nearby states rather than the full state space.
🔍 Handles Inaccurate Models
Inaccuracies in the model may lead to interesting exploration of the local neighbourhood rather than propagating errors into the global value function.
⏱️ Anytime Planning
The agent can use however much compute is available before needing to act. More compute → better action selection — without retraining from scratch.
Forward Search
Select the best action by building a search tree rooted at the current state and looking ahead using the model. No need to solve the whole MDP — just the sub-MDP starting from now.
Root = Current State
The search tree starts at current state st. All branches represent possible future trajectories from here.
Model-Based Expansion
Use the learned or given model to expand tree nodes. At each node, simulate outcomes of each possible action.
Local Sub-MDP
Only solve the portion of the MDP reachable from now. Ignores irrelevant distant states — focused planning.
Best Action at Root
After building the tree, select the action at root with highest estimated value. Repeat each timestep.
Examples: Minimax search (chess), Monte Carlo Tree Search (MCTS, AlphaGo), Model Predictive Control (MPC) — all instances of forward search with different evaluation strategies.
Simulation-Based Search & MCTS
Simulation-Based Search
How It Works
- Simulate K episodes starting from current state st
- Use the model Mη as a simulator
- Apply model-free RL to simulated episodes
- Estimate Q(st, a) for each action a
- Select action with highest estimated Q-value
Why It Works
- Avoids enumerating all states explicitly
- Sampling handles large/continuous action spaces
- Model-free RL algorithms are reused unchanged
- Can allocate compute where it matters most
- No need for global value function — just local estimates
Monte Carlo Tree Evaluation
Given model Mη and simulation policy π, simulate K episodes from current state st. Evaluate each action by mean return. Select action with maximum value.
📊 Unbiased Estimates
MC returns are unbiased estimates of the true value function — no bootstrapping errors, just variance. More simulations → lower variance.
📦 Black-Box Models
Only requires the ability to sample from the model — no need to know transition probabilities explicitly. Any simulator counts.
⚖️ Variance vs K
More simulations K → lower variance but higher compute. Trade-off between accuracy and efficiency. Choose K based on compute budget.
🎯 Simulation Policy Matters
A better simulation policy reduces variance and focuses simulations on relevant parts of the state space. Learned rollout policies accelerate convergence.
MCTS: 4-Step Algorithm
1. SELECT
Traverse the existing tree from root to a leaf node. At each step, pick actions according to Q(s,a) — balancing exploration and exploitation within the tree (e.g. UCB).
2. EXPAND
Add one new node to the tree by expanding the leaf state. Choose an untried action and create a child node. Grows the tree incrementally — one node per simulation.
3. ROLLOUT
From the new node, simulate a complete episode using the fixed rollout policy (often random) until termination. Provides an unbiased return estimate.
4. BACKUP
Propagate the return G back up the entire path from new node to root. Update Q(s,a) for all visited state-action pairs. Tree gets better with each simulation.
Why MCTS is One of the Most Powerful Planning Algorithms
Best-First Search
Concentrates simulations on the most promising branches. Unlike exhaustive methods, it doesn't waste compute on bad moves. The tree grows asymmetrically toward high-value regions.
Anytime Algorithm
Can be stopped at any point and will return the best action found so far. More compute = better decisions (a valid answer is always available). Ideal for real-time or resource-constrained settings.
Black-Box Compatible
MCTS only requires the ability to sample transitions — it does not need explicit P(s'|s,a). Any simulator counts as a valid model, making MCTS broadly applicable to real-world systems.
Breaks Curse of Dimensionality
Evaluates states dynamically through sampling rather than full DP sweeps. Scales to enormous state spaces (e.g. Go: 10¹⁷⁰ states). Only visited nodes are stored — memory proportional to tree, not state space.
Search Tree + Value Function Approximation: The Power Combination
| Approach | Generalisation | Large State Spaces | Example |
|---|---|---|---|
| Model-Free RL (table lookup) | ❌ None | ❌ Cannot store all states | Q-table |
| Simulation-Based Search (table lookup) | ❌ None | ⚠️ Only reachable states | Basic MCTS |
| Simulation-Based Search + VFA | ✅ Similar states get similar values | ✅ Compact parameterised estimates | AlphaGo, AlphaZero |
Glossary
Key terms and definitions from the RL course
Quiz
Test your knowledge across all topics · 25 Questions
Cheat Sheet
Key formulas and algorithm summaries for quick reference
Return & Discount
γ=0: only immediate reward | γ=1: equal weight to all future rewards
MDP Tuple
S States | A Actions | P Transitions
R Rewards | γ Discount
Value Functions
Bellman Expectation
Bellman Optimality
Policy Iteration
Eval: Vk+1(s) ← Σaπ Σs'P[R+γVk(s')]
Improve: π'(s) = argmaxaQπ(s,a)
Repeat until stable.
Value Iteration
No separate eval step — directly uses optimality equation.
Monte Carlo
Model-free. Episodic only. Zero bias, high variance.
TD(0)
SARSA (On-Policy)
Uses (S,A,R,S',A') — next action from same policy.
Q-Learning (Off-Policy)
Uses greedy max — independent of behaviour policy.
Double Q-Learning
Decouples selection from evaluation — reduces overestimation bias in Q-Learning.
Linear VFA
x(s) = feature vector; w = learnable weights. Tabular is a special case.
DQN Loss
w = online network; w⁻ = frozen target network. Replay buffer + target net = stable training.
TD(n) Return
TD(0)=1 step; TD(∞)=Monte Carlo. n controls bias-variance trade-off.
Dyna-Q
Real step: Q-update from real (s,a,r,s')
Model update: Mη ← (s,a,r,s')
Planning: n×Q-update from Mη samples
MCTS Steps
1. Select — traverse tree by Q(s,a)
2. Expand — add new leaf node
3. Rollout — simulate to terminal
4. Backup — propagate return G up
Algorithm Selection Guide
📖 Deep Dives
Book-derived scientific depth — theory, proofs, and formal frameworks from Deep Reinforcement Learning with Python (Sanghi)
Convergence Properties of RL Algorithms
Not all RL algorithms are guaranteed to converge. The convergence behaviour depends on three key factors: whether bootstrapping is used, whether learning is on-policy or off-policy, and whether the function approximator is linear or nonlinear. This table (adapted from Sutton & Barto and Sanghi Chapter 5) summarises the landscape:
| Algorithm | Bootstrapping | On/Off Policy | Tabular | Linear FA | Nonlinear FA |
|---|---|---|---|---|---|
| Monte Carlo | No | On-Policy | ✓ Yes | ✓ Yes | ✓ Yes |
| TD(0) | Yes | On-Policy | ✓ Yes | ✓ Yes | ⚠ No |
| TD(0) | Yes | Off-Policy | ✓ Yes | ⚠ No | ✗ No |
| Q-Learning | Yes | Off-Policy | ✓ Yes | ⚠ No | ✗ No |
| SARSA | Yes | On-Policy | ✓ Yes | ✓ Yes | ⚠ No |
| DQN | Yes | Off-Policy | — | — | ⚠ Empirical |
The Deadly Triad
Convergence failures occur when all three of these combine simultaneously:
- Bootstrapping — using estimates to update estimates
- Off-policy — behaviour policy ≠ target policy
- Function approximation — generalising across states
Each individually is fine. Two together is often fine. All three → potential divergence.
Why Monte Carlo Always Converges
MC uses no bootstrapping — targets are actual returns Gt, not estimated values. This means:
- The update target is stationary (doesn't move as weights change)
- The update is a true gradient of the MSE loss
- Stochastic gradient descent convergence theorems apply
The cost: high variance, no online learning, must wait for episode end.
Linear FA + On-Policy TD
This combination converges to the TD fixed point — a point near (but not at) the true MSE minimum. The error bound is:
The TD fixed point lies within a bounded factor of the best linear approximation possible.
Semi-Gradient Methods: The Formal Justification
When we use bootstrapped targets with function approximation, we face a fundamental problem: the update target itself depends on the weights w. This makes true gradient descent impossible.
The True Gradient Objective
We want to minimise the Mean Squared Value Error:
Taking the true gradient:
The Problem with TD Targets
In TD(0), we replace vπ(s) with the bootstrap target:
Notice: the target Ut depends on w! So the true gradient would be:
This requires differentiating through both v̂(St, w) and v̂(St+1, w).
🎯 The Semi-Gradient Approximation
Instead, we ignore the derivative of the next-state value. We treat Ut as if it were a fixed target that doesn't depend on w:
Only the current state's gradient is computed. The next-state term is dropped.
Why Drop the Next-State Gradient?
- The target Ut is used as a "label" — differentiating through it would create a circular dependency
- This mirrors how supervised learning treats targets as fixed
- For linear FA on-policy, the resulting update still converges (to the TD fixed point)
- This is exactly what deep learning frameworks do:
target_net.detach()in PyTorch stops gradient flow through the target network
Monte Carlo = True Gradient
MC is the only method that performs true gradient descent on J(w), because the return Gt does not depend on w:
Here Gt is a fixed observed return. No approximation is made — this is true SGD. That's why MC converges everywhere.
Semi-Gradient TD vs Full-Gradient (Residual Gradient)
| Property | Semi-Gradient TD | Full-Gradient (Residual) |
|---|---|---|
| Gradient computation | ∇v̂(St, w) only | ∇v̂(St, w) − γ·∇v̂(St+1, w) |
| Convergence (linear, on-policy) | ✓ Yes (TD fixed point) | ✓ Yes (true minimum) |
| Convergence (nonlinear) | ✗ Not guaranteed | ⚠ Slow, unstable |
| Computational cost | One forward pass | Two forward passes |
| Used in practice | ✓ Always (DQN, etc.) | ✗ Rarely |
Backup Diagrams: Formal Theory
Backup diagrams are a compact visual language for describing how value information flows in different RL algorithms. Each diagram encodes exactly which transitions are considered and how deep the lookahead goes.
- Full-width: considers ALL actions
- Full-width: considers ALL next states
- Depth 1: only one step lookahead
- Uses model (transition probabilities)
- Bootstraps from V(s')
- Sample-width: only ONE sampled trajectory
- Full depth: all the way to episode end
- No bootstrapping — actual return used
- No model needed (model-free)
- High variance, zero bias
- Sample-width: ONE sampled transition
- Depth 1: stops at next state
- Bootstraps from V(St+1)
- Low variance, some bias
- Online learning possible
- Sample-width: ONE trajectory
- Depth n: n-step lookahead
- Bootstraps at step n
- Interpolates MC↔TD(0) via n
- n=1: TD(0); n=∞: MC
Off-Policy Monte Carlo Control
Off-policy learning separates the policy used to generate experience (behaviour policy b) from the policy being optimised (target policy π). This enables learning about the optimal policy while still exploring.
Two Policies
| Property | Behaviour Policy b | Target Policy π |
|---|---|---|
| Role | Generates episodes | Being improved |
| Must cover π? | Yes (coverage condition) | — |
| Typical form | ε-soft (explores) | Greedy (deterministic) |
| Used at test time? | No | Yes |
Coverage Condition
For off-policy MC to work, the behaviour policy must cover the target policy:
Every action the target might take must have non-zero probability under the behaviour policy. Otherwise returns from some trajectories would be missing from the estimate.
Importance Sampling
Episodes are generated under b, but we want to evaluate π. Returns Gt from b must be reweighted by the probability ratio:
This ratio ρ is the importance sampling ratio — how much more (or less) likely the trajectory was under π compared to b.
Ordinary IS
Pros: Unbiased estimate of Vπ(s)
Cons: High variance (ρ can be very large or very small)
Weighted IS
Pros: Much lower variance; consistent estimator
Cons: Biased (but bias → 0 as samples → ∞)
Preferred in practice — variance reduction is usually worth the small bias.
Incremental Off-Policy MC Update (Weighted IS)
Using an incremental weighted average (avoids storing all returns):
Early exit: if the behaviour action diverges from the greedy target action, W → 0 and the episode no longer contributes. This stops learning on trajectories irrelevant to π.
Algorithm Landscape
The RL algorithm space can be mapped along multiple axes. Understanding where each algorithm sits helps you pick the right tool for a given problem.
Axis 1: Model-Free vs Model-Based
Axis 2: Value-Based vs Policy-Based
Axis 3: On-Policy vs Off-Policy
Axis 4: Bootstrapping Depth
TD(λ) provides an elegant weighted combination of all n-step returns simultaneously using eligibility traces.
Algorithm Map
Value-Based
Policy-Based
Actor-Critic
Generalised Policy Iteration with Function Approximation
Generalised Policy Iteration (GPI) is the universal framework underlying almost all RL algorithms. It describes the interplay between policy evaluation and policy improvement.
Policy Evaluation
Given π, estimate Vπ(s) or Qπ(s,a)
Policy Improvement
Given Q(s,a), derive a better policy
GPI + Function Approximation
With tabular methods, GPI is well-understood and converges to the optimal policy. With function approximation, things become more complex:
What Changes with VFA
- Evaluation is approximate: V̂(s, w) ≠ Vπ(s) exactly
- Improvement causes distribution shift: changing π changes which states are visited (μ changes)
- Generalisation cuts both ways: updating one state affects others via shared weights
- No guarantee of convergence to exact optimum — GPI finds a near-optimal policy
Tabular Lookup as Special Case of Linear FA
Tabular RL is mathematically a special case of linear function approximation where the feature vector is a one-hot encoding:
Each weight ws is exactly the value of state s. No generalisation occurs — each state is independent. This proves tabular methods are a special case of the general VFA framework.
Semi-Gradient SARSA (GPI in Practice)
This is GPI in action: the TD target serves as the "evaluation" signal, while the ε-greedy policy derived from Q̂ is the "improvement" step. Both happen simultaneously every timestep.
DQN as GPI
DQN implements GPI at scale:
- Evaluation: minimise loss (Q̂(S,A,w) − y)² where y = R + γ·maxaQ̂(S', a, w⁻)
- Improvement: ε-greedy w.r.t. Q̂(·, ·, w)
- Target network w⁻: stabilises the evaluation target
- Replay buffer: decorrelates samples for stable SGD