NOVA IMS · 2nd Semester 2026

Reinforcement Learning

Your complete interactive study guide covering all lectures, MDPs, Dynamic Programming, Monte Carlo, Temporal Difference Learning, and Deep RL.

Intro to RL

Core concepts, four pillars, reward hypothesis, agent components and categories.

Lecture 1

MDPs & Bellman

Markov Decision Processes, value functions, Bellman equations and optimality.

Lecture 2

DP & Monte Carlo

Policy iteration, value iteration, and model-free MC prediction and control.

Lecture 3

Model-Free Methods

TD Learning, TD(n) multi-step, SARSA, GLIE convergence, MC vs TD trade-offs.

Lecture 4

Model-Free Control

Q-Learning, Double Q-Learning, overestimation bias, value function approximation.

Lecture 5

VFA & Planning

Gradient descent VFA, DQN full algorithm, Model-Based RL, Dyna-Q, MCTS.

Lecture 6

Quiz

25 questions across all topics. Test your knowledge and prepare for the exam.

Practice

Your Progress

Introduction to Reinforcement Learning

Lecture 1 · Inês Castelhano · April 2026

What is Reinforcement Learning?

RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment. It receives a numerical reward signal and gradually improves its behaviour through trial and error.

RL vs Other Machine Learning Paradigms

Supervised LearningUnsupervised LearningReinforcement Learning
DataLabelled examplesUnlabelled dataExperience / interaction
FeedbackImmediate, per sampleNone (structure)Delayed reward signal
GoalPredict / classifyFind patternsMaximise cumulative reward
Timei.i.d. datai.i.d. dataSequential, non i.i.d.

The Agent–Environment Interaction Loop

Agent
Action At
State St+1   Reward Rt+1
Environment
1Observe state St
2Select action At using policy π
3Receive reward Rt+1
4Transition to new state St+1

Reward Hypothesis

Reward Hypothesis: All goals can be described by the maximisation of expected cumulative reward.

Reward Rt

Immediate feedback at each time step. Scalar signal indicating how well the agent is doing right now.

Return Gt

Total discounted future reward from time t. The agent's goal is to maximise Gt.

Gt = Rt+1 + γRt+2 + γ²Rt+3 + ...

Discount Factor γ (Gamma)

γ ∈ [0, 1] controls how much future rewards are valued relative to immediate ones.

The Four Pillars of RL

1. Optimisation

Find an optimal way to make decisions, yielding best or very good outcomes.

Restaurant example: Maximise total dinner quality over your stay.

2. Delayed Consequences

Decisions now have long-term impact. Early rewards can hinder the future. Credit must be assigned to past actions.

Restaurant example: Trying a bad place today may lead to a great discovery later.

3. Exploration

Learning about the world by making decisions. Only reward is received for the decision made — this shapes what is learned.

Restaurant example: Trying new places may reveal hidden gems or disappointments.

4. Generalisation

Policy maps experience to action. The agent must generalise from seen situations to unseen ones.

Restaurant example: If you like Italian pizzerias, try similar restaurants.

What Makes RL Different?

No supervisor — only a reward signal
Feedback is delayed, not instantaneous
Time matters — sequential, non i.i.d. data
Humans can learn without examples of optimal behaviour

Core Agent Components

Policy π

The agent's core behaviour function — maps states to actions.

Deterministic: π(s) = a — always take action a in state s
Stochastic: π(a|s) = P(A=a|S=s) — probability distribution over actions
Value Function V(s)

Prediction of future reward from a given state. Used to evaluate states.

vπ(s) = 𝔼π[Gt | St = s]
A state with low immediate reward might have high value if it reliably leads to high-reward states.
Model

Agent's internal representation of the environment.

Transition Model P: Predicts next state given s and a
Reward Model R: Predicts immediate reward for taking a in s
⚠️ A model does not immediately provide a good policy — planning is still needed.
Agent State & History
History Ht: Full sequence of Ot, At, Rt — too large to use directly
Agent State St: Compact summary of history: St = u(Ht)
Environment State: Usually invisible to the agent

Prediction vs. Control

Prediction

Evaluate the future for a given policy. Estimate the value function to measure how much reward the agent will accumulate.

Estimate vπ

Control

Optimise the future — find the best policy that maximises expected cumulative reward.

Find π*
These problems are strongly related. Many RL algorithms alternate between prediction (evaluate policy) and control (improve policy) — known as Generalised Policy Iteration (GPI).

Learning vs. Planning

Learning

Environment is initially unknown. Agent interacts to discover optimal behaviour.

Robot exploring a new building

Planning

A model of the environment is given or learned. Agent plans inside the model without further real interaction.

AlphaGo searching future game states

Exploration vs. Exploitation

Exploration

Try new actions to discover potentially better strategies. Find more information about the environment.

Exploitation

Use what you already know to collect high rewards right now. Exploit known information to maximise reward.

Dilemma: A purely exploitative agent gets stuck in local optima (never discovers better strategies). A purely exploratory agent collects experience but never leverages it to improve. Effective RL must balance both.

Interactive: Explore vs. Exploit

Slide to change the exploration rate ε (probability of exploring):

Agent Categories

By Internal Components

CategoryPolicyValue FunctionDescription
Value-BasedImplicit✅ YesLearn value function; derive policy greedily (e.g. Q-Learning)
Policy-Based✅ Yes❌ NoLearn policy directly; no explicit value function (e.g. REINFORCE)
Actor-Critic✅ Yes✅ YesLearn both — actor updates policy, critic evaluates it

By Use of a Model

Model-Free

No explicit model of the environment. Learns from direct interaction. Examples: Q-Learning, SARSA, Policy Gradient.

Model-Based

Builds or uses a model of the environment. Can plan ahead. Examples: Dynamic Programming, AlphaGo (MCTS).

Markov Decision Processes

Lecture 2 · Tabular Value-Based RL

The Markov Property

Markov Property: "The future is independent of the past, given the present." A state St is Markov if and only if:

P(St+1 | St) = P(St+1 | S1, ..., St)

Once St is known, the history Ht is no longer needed. The current state is a sufficient statistic of the future.

Full Observability (MDP)

Agent sees the full environment state. Observation = Environment State = Agent State.

Partial Observability (POMDP)

Agent cannot directly observe environment state. Must construct a Markov state from history.

MDP: The Formal Definition

A finite MDP is defined by a 5-tuple: (S, A, P, R, γ)
S
State Space — all possible states the agent can be in
A
Action Space — all possible actions the agent can take
P
Transition Probability — P(s' | s, a): probability of moving to s' from s with action a
R
Reward Function — R(s, a): expected immediate reward for taking action a in state s
γ
Discount Factor — γ ∈ [0,1]: how much future rewards are valued

Delivery Driver Mapping

MDP ComponentDriver Problem
S — StatesLocation, active deliveries, time of day
A — ActionsRoute choices, next destination
P — TransitionsTraffic uncertainty (same route → different outcomes)
R — RewardEarnings from delivery minus delay penalty
γ — DiscountHow much current delivery matters vs future opportunities

Episodic vs Continuing Tasks

Episodic Tasks

Break into independent episodes. Each episode has a terminal state. Return is a finite sum.

Driver shift ends at end of day

Continuing Tasks

No natural endpoint. Return is an infinite sum — discount factor is essential to ensure convergence.

Driver works continuously without stopping

Value Functions

State Value Function Vπ(s)

How good is it to be in state s following policy π?

vπ(s) = 𝔼π[Gt | St = s]

= Expected return starting from state s, following π forever.

Action Value Function Qπ(s,a)

How good is it to take action a in state s following policy π?

qπ(s,a) = 𝔼π[Gt | St = s, At = a]

= Expected return starting from s, taking a, then following π.

Relationship between V and Q:
vπ(s) = Σa π(a|s) · qπ(s, a)
State value = expected action value under policy π

Optimal Value Functions

v*(s)

v*(s) = maxπ vπ(s)

Maximum value achievable from state s, over all possible policies.

Q*(s,a)

Q*(s,a) = maxπ qπ(s,a)

Maximum value of taking action a in state s, over all policies.

Bellman Equations

Key Insight: The value of a state today equals the immediate reward plus the discounted value of what follows. This recursive structure is the foundation of all tabular RL algorithms.

Bellman Expectation Equations

For Vπ

vπ(s) = Σaπ(a|s) Σs'P(s'|s,a)[R(s,a) + γvπ(s')]

For Qπ

qπ(s,a) = R(s,a) + γ Σs'P(s'|s,a) Σa'π(a'|s')qπ(s',a')

Bellman Optimality Equations

Replace policy expectation with max — gives us the optimal value functions.

Optimal V*

v*(s) = maxa Σs'P(s'|s,a)[R(s,a) + γv*(s')]

Optimal Q*

Q*(s,a) = R(s,a) + γ Σs'P(s'|s,a) maxa'Q*(s',a')

Breaking Down the Bellman Equation

v(s)
=
R(s,a)
+
γ
·
v(s')
Current value Immediate reward Discount Future value

Frozen Lake — MDP Example

A classic grid-world problem: navigate from Start (S) to Goal (G) without falling into Holes (H).

S
Start
F
Frozen (safe)
H
Hole (terminal)
G
Goal

MDP Components

States: 16 grid cells (4×4)
Actions: Left, Right, Up, Down
Reward: +1 at Goal, 0 elsewhere
Transitions: Slippery — stochastic (1/3 chance each of 3 directions)

Dynamic Programming & Monte Carlo

Lecture 3 · Planning by Dynamic Programming

Dynamic Programming

DP solves complex problems by breaking them into simple sub-problems, computing and storing solutions, and reusing them when the same sub-problem occurs.
Key assumption: Full knowledge of the environment model (P and R).

Optimal Substructure

The optimal solution can be composed from optimal solutions to sub-problems.

Overlapping Subproblems

Sub-problems recur many times — their cached solutions can be reused.

Why MDPs fit DP: Bellman equations have recursive structure → value functions store sub-solutions.

DP Algorithm Summary

AlgorithmBellman Equation UsedProblem Type
Iterative Policy EvaluationExpectation EquationsPrediction
Policy IterationExpectation + Greedy ImprovementControl
Value IterationOptimality EquationsControl

Policy Iteration

1
Initialise arbitrary policy π and value function V
E
Policy Evaluation — iterate Bellman expectation equation until Vπ converges
I
Policy Improvement — greedy update: π'(s) = argmaxa Qπ(s,a)
?
Convergence check — if π' = π, stop. Else go back to Evaluation.
Generalised Policy Iteration (GPI): The general idea of letting policy evaluation and improvement interact — the backbone of virtually all RL algorithms.

Policy Evaluation Formula

Vk+1(s) ← Σaπ(a|s) Σs'P(s'|s,a)[R(s,a) + γVk(s')]

At each iteration, every state's value is updated using the current estimates of neighbouring states. Repeating this until convergence gives Vπ.

Value Iteration

Value Iteration combines evaluation and improvement in one step — no need to wait for full policy evaluation. Uses the Bellman optimality equation directly.
Vk+1(s) ← maxa Σs'P(s'|s,a)[R(s,a) + γVk(s')]

Policy Iteration

  • Full policy evaluation at each step
  • Guaranteed convergence to π*
  • Slower per iteration, fewer iterations

Value Iteration

  • One backup per iteration
  • Also converges to optimal policy
  • Faster per iteration, more iterations

Asynchronous DP

Standard DP updates all states in parallel (synchronous). Asynchronous DP updates states individually in any order — more efficient in practice.

In-Place DP

Only store one copy of value function — use immediately updated values.

Prioritised Sweeping

Update states with the largest Bellman error first. Maintain a priority queue.

Real-Time DP

Only update states relevant to the current agent trajectory.

DP Limitations

Requires a perfect model of the environment
Curse of dimensionality — state space grows exponentially
Full-width backups — expensive for large problems

Monte Carlo Methods

Monte Carlo: Model-free method that learns directly from complete episodes of experience. Instead of expected return (DP), MC uses the mean return across sampled episodes.

Key requirement: MC methods are applied only to episodic tasks — must wait until episode ends.

First-Visit MC

Average returns only from the first occurrence of each state per episode.

N(St) ← N(St) + 1
S(St) ← S(St) + Gt
V(St) = S(St) / N(St)

Every-Visit MC

Average returns every time the state is visited in an episode.

Both converge to vπ(s) as N(s) → ∞ by the law of large numbers.

Incremental MC Updates

Update V(s) incrementally after each episode:

V(St) ← V(St) + α(Gt − V(St))

For non-stationary problems, use a fixed α (learning rate) instead of 1/N to give more weight to recent episodes.

MC Limitations

Only works for episodic settings
High variance returns — needs lots of data
Cannot update until episode ends — slow for long episodes

DP vs Monte Carlo — Comparison

PropertyDynamic ProgrammingMonte Carlo
Model Required?✅ Yes (full model)❌ No (model-free)
Bootstrapping?✅ Yes❌ No (uses actual returns)
Works on Continuing Tasks?✅ Yes❌ Episodic only
Update timingEach stepEnd of episode only
BiasLow (exact model)Zero bias
VarianceLowHigh
Curse of dimensionalitySevereLess severe (sampling)
Teaser: Temporal Difference (TD) learning combines the best of both — bootstrapping from DP without needing a model, and learning from incomplete episodes like DP can.

Model-Free Methods

Lecture 4 · Temporal Difference Learning · Inês Castelhano · May 2026

Temporal Difference Learning

"If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be TD learning." — Sutton & Barto

TD is model-free (no knowledge of MDP transitions/rewards), learns directly from experience at each step, and uses bootstrapping — updating estimates using other estimates.

TD(0) Update Rule

V(St) ← V(St) + α · [Rt+1 + γ·V(St+1) − V(St)]
Rt+1 + γV(St+1)
TD Target
Observed reward + bootstrapped future
V(St)
Current Estimate
=
δt
TD Error
Drives the update

Trading Example Mapping

TD ConceptStock Trading Interpretation
State StCurrent market conditions (price, volatility, holdings)
Action AtBuy / Hold / Sell
Reward Rt+1Immediate profit or loss from the trade
TD TargetObserved short-term profit + estimated future portfolio value
TD Error δtDifference between expected and observed market outcome
Update timingAfter every market movement, not at market close
Advantages over MC:
• Updates at every step — online learning
• Works in continuing (non-episodic) settings
• Lower variance (only one step of randomness)
• Can learn before episode ends
Advantages over DP:
• No model required (P and R unknown)
• Learns from sampled experience
• Scales to large problems without full sweeps

Multi-Step TD: TD(n) and TD(∞)

TD(0) bootstraps after just 1 step. We can look further ahead before bootstrapping — this gives TD(n), interpolating between TD and MC.
TD(0)
1 real reward → bootstrap
TD(2)
2 rewards → bootstrap
TD(n)
n rewards → bootstrap
TD(∞) = MC
Full episode → no bootstrap

TD(n) Update Formula

Gt(n) = Rt+1 + γRt+2 + ... + γn-1Rt+n + γnV(St+n)
V(St) ← V(St) + α[Gt(n) − V(St)]

Trading Example — TD(3)

Observe 3 days of market returns, then use your current estimate of future portfolio value. No need to wait for the full trading period.

Trading Example — TD(∞)

Wait until the end of the full trading episode and use the complete realized return — equivalent to Monte Carlo.

Key insight: Larger n = less bias, more variance (more like MC). Smaller n = more bias, less variance (more like TD). Optimal n depends on the problem.

SARSA — TD for Action Values

Apply TD to the Q-function instead of V. SARSA is named after the tuple it uses: (St, At, Rt+1, St+1, At+1).
Q(St,At) ← Q(St,At) + α[Rt+1 + γ·Q(St+1,At+1) − Q(St,At)]
SARSA Algorithm (On-Policy TD Control)
Initialize Q(s,a) for all s∈S, a∈A; set Q(terminal,·) = 0
For each episode:
Initialize S; choose A from S using ε-greedy on Q
Repeat for each step until S is terminal:
Take action A → observe R, S'
Choose A' from S' using ε-greedy on Q
Q(S,A) ← Q(S,A) + α[R + γQ(S',A') − Q(S,A)]
S ← S'; A ← A'

GLIE — Greedy in the Limit with Infinite Exploration

GLIE is the formal condition for convergence to optimality in model-free control.
1
All state-action pairs explored infinitely:
∀s,a: limt→∞ N(s,a) = ∞
Every (state, action) pair must be tried infinitely often — no pair is ignored forever.
2
Policy converges to greedy:
limt→∞ πt(a|s) = 𝟙[a = argmaxa' q(s,a')]
Eventually exploit more and explore less — ε must decay to 0.
Theorem: GLIE model-free control converges to the optimal action-value function: q → q*.
Tabular SARSA converges to q* if the policy is GLIE.

Trading Example

Early in training, try all combinations of market conditions and actions (buy/hold/sell) many times — even bad ones — so no strategy is left unexplored. Over time, reduce ε so the agent exploits more of what it learned.

Monte Carlo vs TD Learning

PropertyMonte CarloTD Learning
Update timingEnd of episode onlyAfter every step (online)
Episodic tasks only?✅ Yes❌ No — works for continuing too
Bootstrapping❌ No — uses actual return✅ Yes — uses estimated V(S')
BiasZero (unbiased)Some bias (bootstrapping)
VarianceHigh — full episode randomnessLow — only 1-step randomness
Markov property needed?❌ No✅ Yes — exploits Markov property
Memory neededStore full episodeOnly current transition

Monte Carlo — Pros & Cons

Good convergence properties
Zero bias — unbiased estimate of vπ
Works in partially-observable environments (no Markov needed)
High variance — many random steps
Requires complete episodes

TD Learning — Pros & Cons

Low variance — 1-step update
Usually more efficient than MC
Works in non-episodic settings
Some bias (bootstrapped estimate)
More sensitive to initial values
In the tabular case: Both MC and TD converge to the true value function vπ(s). The choice depends on the environment (episodic vs continuing) and the bias-variance trade-off needed.

On-Policy vs Off-Policy Learning

On-Policy Learning

The agent learns the value of the same policy it is currently following. The behaviour policy (generating actions) equals the target policy (being improved).

Learn about π using experience from π
Example: SARSA — explores with ε-greedy and improves the same ε-greedy policy

Off-Policy Learning

The agent learns the value of a different (target) policy from the behaviour policy generating data. Can learn from other agents or old data.

Learn about π using experience from μ
Example: Q-Learning — explores with ε-greedy but learns the greedy policy

Why Off-Policy Matters

Learn from demonstrations by an expert (imitation)
Re-use old experience (experience replay)
Learn multiple policies simultaneously
Learn from human or random behaviour

DP vs Monte Carlo vs TD — Unified View

These methods differ in how much they rely on sampling (model vs experience) and how far they look ahead before updating (depth of backup).
PropertyDynamic ProgrammingMonte CarloTD Learning
Model required?✅ Yes❌ No❌ No
Bootstrapping?✅ Yes❌ No✅ Yes
Sampling?❌ No (expectation)✅ Yes✅ Yes
Update timingEvery step (full sweep)End of episodeEvery step
Episodic only?❌ No✅ Yes❌ No
BiasLowZeroSome
VarianceLowHighLower than MC

Backup Diagrams — Visual Summary

DP — Full width, shallow

s
a
s'
s'
a
s'

All actions & states. Full model. 1-step expectation.

MC — Narrow, deep

s
a
s'
···
T

Sampled path to terminal state. Full return.

TD(0) — Narrow, shallow

s
a
s'

Single sampled step. Bootstraps from V(s').

Model-Free Control

Lecture 5 · Q-Learning, Double Q-Learning, VFA Intro · Inês Castelhano · May 2026

Q-Learning — Off-Policy TD Control

Q-Learning estimates the value of the greedy policy regardless of which policy generates the data. It is off-policy: the behaviour policy explores, but Q-Learning targets the optimal action.
Q(S,A) ← Q(S,A) + α[R + γ · maxa'Q(S',a') − Q(S,A)]
Q-Learning Algorithm (Off-Policy TD Control)
Initialize Q(s,a) for all s∈S, a∈A arbitrarily; Q(terminal,·) = 0
For each episode:
Initialize S
Repeat for each step until S is terminal:
Choose A from S using ε-greedy on Q (behaviour policy)
Take action A → observe R, S'
Q(S,A) ← Q(S,A) + α[R + γ maxa'Q(S',a') − Q(S,A)]
S ← S'
Theorem: Q-Learning control converges to the optimal action-value function q → q*, as long as we take each action in each state infinitely often.

Autonomous Driving Example

A self-driving car learns the fastest route to a destination by always estimating future decisions as optimal, even while still exploring different (suboptimal) driving behaviours. It learns from the actions it wishes it would take.

Interactive Q-Table Demo

Watch Q-values update step by step in a simple 3-state world (use Left/Right to navigate toward Goal):

Q-Learning Overestimation Problem

Classical Q-learning has an upward bias problem:
It uses the same values to both select and evaluate actions. With noisy approximations, overestimated values are selected more often, propagating the bias.

Driving Example

A self-driving car incorrectly estimates that a narrow, high-speed route is safer and faster than it really is — because a few successful experiences produced overly optimistic Q-values.

Double Q-Learning — Decoupling Selection from Evaluation

Solution: Maintain two independent Q-functions: Q and Q*. Use one to select the action and the other to evaluate it.

Update for Q (when Q is selected)

Q(S,A) ← Q(S,A) + α[R + γQ*(S', argmaxa'Q(S',a')) − Q(S,A)]

Update for Q* (when Q* is selected)

Q*(S,A) ← Q*(S,A) + α[R + γQ(S', argmaxa'Q*(S',a')) − Q*(S,A)]
Double Q-Learning
Initialize Q(s,a) and Q*(s,a) for all s, a
For each step:
Choose A using combined policy (e.g. average of Q and Q*)
Observe R, S'
With 50% probability update Q using Q* for evaluation (or vice versa)
Converges to the optimal policy under the same conditions as Q-learning
Significantly reduces overestimation bias
Can generalise to Double SARSA and other algorithms
Not necessary if target policy and value function are uncorrelated (e.g. pure prediction)

SARSA vs Q-Learning — Complete Comparison

PropertySARSAQ-Learning
Policy typeOn-policyOff-policy
Target actionA' sampled from behaviour policy (ε-greedy)maxa'Q(S',a') — greedy target
UpdateQ(S,A) + α[R+γQ(S',A')−Q(S,A)]Q(S,A) + α[R+γmax Q(S',a')−Q(S,A)]
SafetyMore conservative — safer in dangerous environmentsCan be more aggressive — ignores current risk
ConvergenceConverges to q* if GLIE (ε→0)Converges to q* as long as each (s,a) visited ∞ often
OverestimationLess prone (stochastic target)Prone — uses greedy max
Best forSafety-critical tasks, continuous explorationFinding optimal policy efficiently
Key difference in one sentence: Q-learning uses a greedy target policy; SARSA uses a stochastic sample from the behaviour policy as its target.

Why Tabular Methods Fail — Motivation for VFA

Lookup tables store one value per state (or state-action pair). This breaks down for large or continuous state spaces.

Memory

Too many states/actions to store in memory

Speed

Too slow to learn each state individually

Observability

Individual states often not fully observable

Generalisation

Each state learned independently — no transfer between similar states

Function Approximation — The Solution

1
ESTIMATE
Approximate value function using a parameterised function with weights w: v̂(s;w) ≈ vπ(s)
2
UPDATE
Adjust weights w using MC or TD learning to minimise prediction error
3
GENERALISE
Updating w for one state automatically improves predictions for similar states

Classes of Function Approximators

TypeAdvantagesDisadvantages
TabularStrong theory, exact valuesDoes not scale or generalise
LinearReasonable theory, efficient, stableRequires good hand-crafted features
Non-LinearScales well, learns features automaticallyLess well-understood theory, harder to train
Deep Neural NetworksOften best performance in practiceLess theory, harder to train, needs much data

Challenges of Function Approximation in RL

Unlike supervised learning, RL has specific properties that make function approximation harder:

Experience is not i.i.d.

Successive time-steps are correlated. Supervised learning assumes independent, identically distributed samples — that assumption fails in RL.

Policy affects data distribution

The policy determines which data is collected, which affects what is learned — creating a tight feedback loop between learning and experience.

The Deadly Triad

Algorithms combining all three of the following elements may diverge — this is known as the Deadly Triad.

Bootstrapping

Using TD targets (estimates of estimates). Introduces bias and non-stationarity in targets.

+

Off-Policy Learning

Learning from data generated by a different policy. Mismatched distributions between experience and target policy.

+

Function Approximation

Using parameterised functions instead of tables. Small errors in approximation can compound.

Non-stationarity sources: Changing policies alter target values and data distribution; TD bootstrapping creates moving targets; large state spaces prevent stable anchoring.

Value Function Approximation & Planning

Lecture 6 · VFA, DQN, Model-Based RL, Dyna-Q, MCTS · Inês Castelhano · May 2026

Why Do We Need Value Function Approximation?

Tabular RL stores one value per state. Real-world problems have millions or infinite states. Function approximation (VFA) lets us generalise across similar states using a compact parameterised function v̂(s;w).
💾

Memory

Too many states/actions to store in a table. Backgammon has ~10²⁰ states; Go has ~10¹⁷⁰.

⏱️

Speed

Too slow to visit and learn each state individually. Most states may never be seen during training.

👁️

Observability

Individual environment states often not fully observable. Partial observability means we can't index a table by state.

🔗

Generalisation

Each state is learned independently — no transfer between similar states. VFA automatically generalises across states.

Function Approximation: Three-Step Process

1

Estimate

Approximate value function using a parameterised function v̂(s;w) with weights w. The function can be linear, neural network, or any differentiable model.

2

Update

Adjust weights w using MC or TD learning to minimise prediction error. Gradient descent moves w in the direction of lower loss.

3

Generalise

Updating w for one state automatically improves predictions for similar states. This is the key advantage over tabular methods.

Classes of Function Approximators

ApproximatorAdvantagesDisadvantages
Tabular Strong theory, exact values, guaranteed convergence Does not scale or generalise — one value per state
Linear Reasonable theory, efficient updates, stable convergence Requires good hand-crafted features; limited expressiveness
Non-Linear Scales well, can represent complex functions Less well-understood theory, harder to train
(Deep) Neural Networks Often performs best; learns features from raw inputs (pixels) Less well-understood theory; requires careful training tricks
Key Principle: The policy, value function, model, and agent state update are all functions. We want to learn all of them from experience. If there are too many states, we need to approximate. When using neural networks: this is called Deep RL.

Gradient-Based Learning for Value Functions

Gradient-descent methods are the most widely used function approximation methods in RL. The parameter vector w defines v̂(s;w) — a smooth differentiable function. We minimise prediction error by following the gradient.

Incremental Gradient Descent

w ← w + α · δ · ∇wv̂(s,w)
  • Update weights after each experience sample
  • Small step in direction of reduced error
  • Works online during episode
  • Can follow non-stationary targets

Stochastic Gradient Descent (SGD)

w ← w − α ∇wL(w)
  • Randomly samples from experience
  • Estimates the true gradient
  • Converges to local minimum with decaying α
  • Unbiased estimate of full gradient

Specific Challenges in RL (vs Supervised Learning)

🔄 Experience is Not i.i.d.

Successive timesteps are correlated. Standard supervised learning assumes independent, identically distributed samples — that assumption fails in RL. Consecutive (s,a,r,s') tuples share context.

🎭 Policy Affects Data

The agent's policy determines which data is collected. This creates a tight feedback loop: what we learn changes the policy, which changes what data we see next.

🌊 Changing Policy

Policy changes alter both the target values and the data distribution simultaneously — a double non-stationarity problem.

🎯 Bootstrapping

TD methods use current estimates as targets, which themselves keep changing. We're chasing a moving target — and the target moves because of our own updates.

🌍 Non-Stationary Targets

Other learning agents in multi-agent settings shift the effective dynamics. Even solo agents face non-stationarity due to policy improvement.

♾️ Large State Space

In continuous state spaces, we never visit the exact same state twice. No stable anchor point for estimates — states seen only once or never during training.

Update Targets: MC vs TD

MethodUpdate Target UtGradient TypeProperties
MC + VFA Actual return Gt True gradient (Gt fixed) Unbiased, high variance, no bootstrapping
TD(0) + VFA R + γv̂(S';w) Semi-gradient (ignore ∇v̂(S')) Biased, low variance, fast, can diverge
TD(n) + VFA n-step return G(n) Semi-gradient Intermediate bias/variance trade-off
Semi-Gradient: TD with VFA doesn't differentiate through the bootstrap target v̂(S';w). We treat the target as fixed — only ∇wv̂(St;w) is computed. Simpler and faster, but not a true gradient step. Prone to the Deadly Triad.

Linear Function Approximation

The simplest differentiable approximator. Values are a linear combination of hand-crafted features. Well-understood theory, stable convergence for on-policy TD, computationally efficient updates.
v̂(s;w) = wT · x(s) = Σj wj · xj(s)

Feature Vector x(s)

A fixed feature map converting each state s into an n-dimensional real-valued vector. Features encode relevant state information — must be chosen by the designer.

🚗 Driving: [speed, lane_position, obstacle_distance]

The gradient of v̂(s,w) w.r.t. w is simply x(s) — making updates very efficient:

w ← w + α[vtarget − v̂(s,w)] · x(s)

Weight Vector w

The learnable parameters. Each weight wj corresponds to feature j. Updating w affects the value of every state that has that feature active — this is generalisation.

🧠 Linear FA: updating wspeed improves estimates for all states with similar speed

Key: Tabular and state aggregation methods are special cases of linear FA.

State Aggregation Methods (Feature Encoding Strategies)

Coarse Coding

Circles/regions in continuous space. Feature xj(s) = 1 if state s lies inside region j, else 0. Multiple overlapping regions for one state.

Small circles → narrow generalisation
Large circles → broad generalisation
Asymmetric shapes → directional generalisation

Tile Coding

Overlay multiple regular grids (tilings) offset from each other. State is represented by the set of tiles it falls in across all tilings.

Efficient for continuous spaces
Control generalisation by tile width
Binary features — very fast computation
Multiple tilings → fine resolution

Radial Basis Functions (RBF)

Each feature is a Gaussian centred on a prototype state cj. Smooth, continuous generalisation — the closer s is to cj, the higher xj(s).

xj(s) = exp(−|s−cj|²/2σ²)
Smooth generalisation
More flexible than binary coding
Sensitive to prototype placement and σ

Kanerva Coding

Prototype-based: features measure Hamming distance to stored prototype states. Designed for large binary/discrete state spaces.

Good for high-dimensional binary inputs
Feature = similarity to prototype
Prototype selection matters greatly

Control with VFA: Extending to Q-Functions

1

Approximate Q-Function

Represent q̂(s,a;w) ≈ qπ(s,a) using a parameterised function. Input: state s (or state-action pair). Output: estimated action value.

2

Policy Improvement

Greedy or ε-greedy action selection based on estimated Q-values. Ensures exploration while gradually improving the policy.

3

On vs Off-Policy

On-policy (SARSA): use same policy for data and learning. Off-policy (Q-Learning): learn greedy policy while following exploratory behaviour.

Batch Methods: Least Squares Solutions

Instead of updating w one sample at a time, batch methods find the best fitting w over all experience at once.

LSTD — Least Squares TD

  • Closed-form solution — no step-size α needed
  • Converges directly to the TD fixed point
  • Much more sample-efficient than online TD
  • Extended to multi-step: LSTD(λ)
  • Extended to action values: LSTDQ

LSPI — Least Squares Policy Iteration

  • Interleaves LSTDQ with policy improvement (GPI)
  • Converges faster than incremental methods
  • Practical for moderate-sized problems
  • Replaces Q-table with LSTDQ at each iteration
  • Combines best of batch learning + policy iteration
PropertyIncremental (Online)Batch Methods (LSTD)
ImplementationSimpleMore complex
Data efficiencyEach transition used onceRe-uses all experience
MemoryConstant (no storage)Must store experience D
Step-size αRequired, sensitiveNot needed (closed-form)
Best forStreaming data, online learningFixed dataset, sample efficiency critical

Deep Reinforcement Learning & DQN

Deep neural networks replace hand-crafted features with learned representations. DQN was the breakthrough that made deep RL practical — by combining Q-learning with two key stability tricks.

Why Neural Networks?

🧮

Universal Approximator

Can represent any continuous function given enough capacity. No need to hand-craft features.

🧱

Distributed Representations

Exponentially fewer nodes than shallow networks for the same function. Deep = efficient.

👀

Learned Features

No hand-crafted features — features emerge from raw inputs (pixels, sensor readings). Learn directly from observations.

Gradient-Based Training

Learnable via stochastic gradient descent and backpropagation. Same machinery used in all of deep learning.

Convolutional Neural Networks (CNNs) in RL

Key architecture for processing spatial inputs (game screens, sensor grids, camera images).

📍 Local Structure

Not fully connected — each neuron sees only a local receptive field. Dramatically reduces parameters. Considers local spatial relationships.

🔁 Weight Sharing

All neurons in a feature map share the same weights. The same filter detects the same feature (e.g., edge) at different locations in the image.

🗺️ Feature Maps

A convolutional filter defines a feature map: all nodes detect the same feature across the input. Multiple filter banks learn diverse features simultaneously.

🗜️ Pooling

Sub-sampling operations compress feature maps, extracting salient information and favouring generalisation to new inputs.

What Transfers from Tabular RL — and What Doesn't

✅ What Transfers

  • TD and MC learning update rules
  • Double learning (Double Q-Learning)
  • Experience replay concept
  • ε-greedy exploration
  • SARSA / Q-learning structure

❌ What Doesn't Transfer Easily

  • Least squares TD/MC (closed-form becomes intractable)
  • Exact convergence guarantees
  • Tabular policy iteration (too many states)

Why Plain Q-Learning + NN is Unstable

Problem 1: Correlated Samples

Sequential transitions are highly correlated — violates the i.i.d. assumption of SGD. Causes oscillation and divergence during training.

t, t+1, t+2 transitions all share similar context

Problem 2: Non-Stationary Targets

As w updates, the TD target r + γ max Q(s',a';w) also shifts. The agent chases a moving target — nothing to anchor learning.

target changes every time w changes

DQN's Two Solutions

🔀 Experience Replay

Store transitions (s,a,r,s') in a replay buffer D. Sample random mini-batches to train. Breaks correlations, improves sample efficiency.

1Collect — Store (s,a,r,s') in buffer D via ε-greedy
2Sample — Random mini-batch from D breaks correlations
3Learn — SGD on sampled batch; minimise L(w)
4Repeat — Each transition replayed many times = sample efficiency

🧊 Fixed Q-Targets (Target Network)

Use a separate frozen network w⁻ to compute TD targets. Update w⁻ ← w only every C steps. Stabilises the target during updates.

y = r + γ · maxa' Q(s',a'; w⁻)
L(w) = 𝔼[(y − Q(s,a;w))²]

The target y doesn't move during learning — only w changes. Restart target sync every C steps.

DQN Architecture

State s
(pixels / sensors)
Q(s,a₁;w)
Q(s,a₂;w)
Q(s,a₃;w)
State s
TD Target y
= r + γ max Q(s',a';w⁻)
DQN Full Algorithm
Initialize replay buffer D, Q-network w (random), target network w⁻ ← w
For each episode:
Reset environment, observe initial state s
For each timestep t:
With prob ε: choose random action A (explore)
Otherwise: A = argmaxa Q(s,a;w) (exploit)
Execute A, observe reward R and next state S'
Store transition (S, A, R, S') in replay buffer D
Sample random mini-batch of transitions from D
Compute targets: yi = ri + γ · maxa'Q(si',a';w⁻) [or ri if terminal]
Perform SGD step: minimise Σ(yi − Q(si,ai;w))² → update w
Every C steps: w⁻ ← w (sync target network)
Convergence note: DQN has no formal convergence guarantees (deadly triad: bootstrapping + off-policy + nonlinear FA). In practice it works because replay + target network dampen instability enough for empirical convergence on many tasks.

Model-Based Reinforcement Learning

A model Mη approximates the MDP dynamics (transitions + rewards) learned from experience. The agent then plans using the model — gaining sample efficiency by re-using experience without additional real-world interactions.

Three Approaches to Solving MDPs

Dynamic Programming

Assumes a complete, known model. No interaction needed. Requires full knowledge of P(s'|s,a) and R(s,a).

❌ Impractical for large or unknown environments

Model-Free RL

No model required. Interact with the environment directly. SARSA, Q-learning, MC. Works even when dynamics are unknown.

✅ Scales to complex real-world domains

Model-Based RL

Learn a model from experience, then plan using it. Dyna-Q combines both approaches. Model errors can compound over planning horizons.

⚖️ Balances data efficiency with adaptability

What is a Model?

A model Mη is an approximate representation of the MDP. States and actions are the same as the real problem. The dynamics are parameterised by weights η. Learning a model is a supervised learning problem: choose functional form → pick a loss function (e.g. MSE) → find η that minimises empirical loss.

Types of Models

Model TypeWhat it PredictsProperties
Expectation Model Expected next state: 𝔼[st+1|st,at] Simple, tractable; ignores stochasticity; mostly linear
Stochastic / Generative Full distribution: s' ~ P(·|s,a) Captures uncertainty; supports risk-sensitive planning
Full Model Both P(s'|s,a) and R(s,a) jointly Most expressive; hardest to learn accurately
Decomposed Dynamics Separate ηT for transitions, ηR for rewards Flexible; allows Table Lookup / Linear / NN sub-models

Learning vs Planning

Learning

The environment is initially unknown. The agent interacts with the environment to gather data and learn from real experience. Direct RL updates improve value / policy.

Planning

A model of the environment is given or has been learnt. The agent plans inside the model without further real interaction — simulates experience cheaply.

An agent can learn a model from real experience and then plan within that model, combining the adaptability of learning with the efficiency of planning. This is the Dyna architecture.

Sample-Based Planning

Instead of solving the full MDP analytically, sample experience from the model and treat it as real experience. Any model-free RL algorithm can be applied to simulated data.

MC Control

Simulate full episodes from model. Compute returns. Update Q-values using MC estimates. Works well when model is accurate.

SARSA (On-Policy)

Simulate step-by-step transitions. Apply on-policy TD update. Works with partial episodes — more data-efficient than full MC.

Q-Learning (Off-Policy)

Sample transitions from model. Apply off-policy TD with max operator. Flexible — behaviour and target policies can differ.

Advantages of Sample-Based Planning:

  • Avoids expensive full DP sweeps — no need to enumerate all states
  • Can use any model-free algorithm unchanged on simulated data
  • Naturally handles large state spaces via sampling
  • Computationally flexible — run as many samples as budget allows
  • Combines smoothly with real experience (Dyna-Q architecture)

Handling Model Inaccuracy

Model-based RL is only as good as the estimated model. Performance is bounded by the optimal policy for the approximate MDP, not the true one.

1. Model-Free Fallback

When the model is wrong, fall back to model-free RL using real experience. Auto-detects divergence between model predictions and real outcomes.

2. Bayesian Uncertainty

Reason about model uncertainty using Bayesian methods (distribution over η). Plan under model uncertainty — balances exploration and exploitation at the model level.

3. Combined Approaches

Combine model-based and model-free in a single algorithm (e.g. Dyna-Q). Real experience for model updates + safety; model for planning.

Dyna-Q & Forward Search Planning

Dyna-Q combines real experience (model-free Q-learning) with simulated experience (planning from a learned model). Both happen simultaneously every step — learning and planning in a unified loop.

Dyna-Q Algorithm — Step by Step

1
Real Step
Take action at in the real environment. Observe reward rt+1 and next state st+1.
2
Direct RL Update
Apply Q-learning directly from real experience: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') − Q(s,a)]
3
Model Update
Update model Mη with observed transition (st, at, rt+1, st+1). Supervised learning on dynamics.
4
Planning (n steps)
Repeat n times: sample random (s,a) from memory, simulate s' ~ Mη, apply Q-learning update on simulated transition.
Key advantage: More compute → more planning steps per real step → learns faster when real data is expensive, slow, or unsafe (robotics, medical trials, autonomous driving).

Integrating Model-Free and Model-Based

Model-Free RL (in Dyna-Q)

  • No model required for this part
  • Learn value function from real experience
  • More robust to model errors
  • Higher variance, less sample efficient
  • Algorithms: Q-learning, SARSA, A3C, PPO

Model-Based Planning (in Dyna-Q)

  • Learns model from real experience
  • Plans value function using simulated experience
  • More sample efficient
  • Requires accurate model — errors compound
  • Real experience is used for both model update and direct RL

Planning for Action Selection

Rather than improving a global value function, an agent can plan specifically to select the next action. The key insight: the distribution of states reachable from the current state differs from the global training distribution.

🎯 Local Accuracy

Focus all planning compute on states that will actually be encountered in the near future. Build a more accurate local value function for nearby states rather than the full state space.

🔍 Handles Inaccurate Models

Inaccuracies in the model may lead to interesting exploration of the local neighbourhood rather than propagating errors into the global value function.

⏱️ Anytime Planning

The agent can use however much compute is available before needing to act. More compute → better action selection — without retraining from scratch.

Forward Search

Select the best action by building a search tree rooted at the current state and looking ahead using the model. No need to solve the whole MDP — just the sub-MDP starting from now.

Examples: Minimax search (chess), Monte Carlo Tree Search (MCTS, AlphaGo), Model Predictive Control (MPC) — all instances of forward search with different evaluation strategies.

Simulation-Based Search & MCTS

MCTS is a simulation-based forward search — it builds a search tree from the current state using a model, focuses compute on the most promising branches, and is the backbone of AlphaGo/AlphaZero.

Simulation-Based Search

How It Works

  • Simulate K episodes starting from current state st
  • Use the model Mη as a simulator
  • Apply model-free RL to simulated episodes
  • Estimate Q(st, a) for each action a
  • Select action with highest estimated Q-value

Why It Works

  • Avoids enumerating all states explicitly
  • Sampling handles large/continuous action spaces
  • Model-free RL algorithms are reused unchanged
  • Can allocate compute where it matters most
  • No need for global value function — just local estimates
Simulation policy flexibility: Can be random, learned, or hand-crafted. A random rollout policy still gives unbiased MC estimates of Q-values without any learned value function. MCTS improves on this by focusing simulations on promising subtrees.

Monte Carlo Tree Evaluation

Given model Mη and simulation policy π, simulate K episodes from current state st. Evaluate each action by mean return. Select action with maximum value.

📊 Unbiased Estimates

MC returns are unbiased estimates of the true value function — no bootstrapping errors, just variance. More simulations → lower variance.

📦 Black-Box Models

Only requires the ability to sample from the model — no need to know transition probabilities explicitly. Any simulator counts.

⚖️ Variance vs K

More simulations K → lower variance but higher compute. Trade-off between accuracy and efficiency. Choose K based on compute budget.

🎯 Simulation Policy Matters

A better simulation policy reduces variance and focuses simulations on relevant parts of the state space. Learned rollout policies accelerate convergence.

MCTS: 4-Step Algorithm

1. SELECT

Traverse the existing tree from root to a leaf node. At each step, pick actions according to Q(s,a) — balancing exploration and exploitation within the tree (e.g. UCB).

2. EXPAND

Add one new node to the tree by expanding the leaf state. Choose an untried action and create a child node. Grows the tree incrementally — one node per simulation.

3. ROLLOUT

From the new node, simulate a complete episode using the fixed rollout policy (often random) until termination. Provides an unbiased return estimate.

4. BACKUP

Propagate the return G back up the entire path from new node to root. Update Q(s,a) for all visited state-action pairs. Tree gets better with each simulation.

Two simulation policies: (1) Tree policy — navigates the built tree, improves during search using Q(s,a). (2) Rollout policy — held fixed (often random), used to evaluate new nodes from outside the tree.

Why MCTS is One of the Most Powerful Planning Algorithms

Best-First Search

Concentrates simulations on the most promising branches. Unlike exhaustive methods, it doesn't waste compute on bad moves. The tree grows asymmetrically toward high-value regions.

Anytime Algorithm

Can be stopped at any point and will return the best action found so far. More compute = better decisions (a valid answer is always available). Ideal for real-time or resource-constrained settings.

Black-Box Compatible

MCTS only requires the ability to sample transitions — it does not need explicit P(s'|s,a). Any simulator counts as a valid model, making MCTS broadly applicable to real-world systems.

Breaks Curse of Dimensionality

Evaluates states dynamically through sampling rather than full DP sweeps. Scales to enormous state spaces (e.g. Go: 10¹⁷⁰ states). Only visited nodes are stored — memory proportional to tree, not state space.

Search Tree + Value Function Approximation: The Power Combination

ApproachGeneralisationLarge State SpacesExample
Model-Free RL (table lookup) ❌ None ❌ Cannot store all states Q-table
Simulation-Based Search (table lookup) ❌ None ⚠️ Only reachable states Basic MCTS
Simulation-Based Search + VFA ✅ Similar states get similar values ✅ Compact parameterised estimates AlphaGo, AlphaZero
AlphaGo / AlphaZero: MCTS + deep neural network value function + policy network. The NN generalises across similar board positions; MCTS focuses planning compute on the most promising lines of play. Together they defeated world Go champions — previously considered impossible for AI.

Glossary

Key terms and definitions from the RL course

Quiz

Test your knowledge across all topics · 25 Questions

25 Questions
All Topics
Self-paced

Cheat Sheet

Key formulas and algorithm summaries for quick reference

Return & Discount

Gt = Σk=0 γk·Rt+k+1
Gt = Rt+1 + γ·Gt+1

γ=0: only immediate reward | γ=1: equal weight to all future rewards

MDP Tuple

(S, A, P, R, γ)

S States  |  A Actions  |  P Transitions
R Rewards  |  γ Discount

Value Functions

vπ(s) = 𝔼π[Gt|St=s]
qπ(s,a) = 𝔼π[Gt|St=s,At=a]
vπ(s) = Σaπ(a|s)·qπ(s,a)

Bellman Expectation

vπ(s) = Σaπ(a|s)Σs'P[R+γvπ(s')]

Bellman Optimality

v*(s) = maxaΣs'P[R+γv*(s')]
Q*(s,a) = R+γΣs'P·maxa'Q*(s',a')

Policy Iteration

Eval: Vk+1(s) ← Σaπ Σs'P[R+γVk(s')]

Improve: π'(s) = argmaxaQπ(s,a)

Repeat until stable.

Value Iteration

Vk+1(s) ← maxaΣs'P[R+γVk(s')]

No separate eval step — directly uses optimality equation.

Monte Carlo

V(St) ← V(St) + α(Gt−V(St))

Model-free. Episodic only. Zero bias, high variance.

TD(0)

V(St) ← V(St) + α[Rt+1+γV(St+1)−V(St)]
δt = Rt+1+γV(St+1)−V(St) (TD error)

SARSA (On-Policy)

Q(S,A) ← Q(S,A) + α[R+γQ(S',A')−Q(S,A)]

Uses (S,A,R,S',A') — next action from same policy.

Q-Learning (Off-Policy)

Q(S,A) ← Q(S,A) + α[R+γ maxa'Q(S',a')−Q(S,A)]

Uses greedy max — independent of behaviour policy.

Double Q-Learning

Q(S,A) ← Q(S,A)+α[R+γQ*(S',argmax Q(S',a'))−Q(S,A)]

Decouples selection from evaluation — reduces overestimation bias in Q-Learning.

Linear VFA

v̂(s;w) = wT·x(s)
w ← w + α[vtarget−v̂(s,w)]·x(s)

x(s) = feature vector; w = learnable weights. Tabular is a special case.

DQN Loss

L(w) = 𝔼[(r+γ max Q(s',a';w⁻)−Q(s,a;w))²]

w = online network; w⁻ = frozen target network. Replay buffer + target net = stable training.

TD(n) Return

Gt(n)=Rt+1+γRt+2+...+γn-1Rt+nnV(St+n)

TD(0)=1 step; TD(∞)=Monte Carlo. n controls bias-variance trade-off.

Dyna-Q

Real step: Q-update from real (s,a,r,s')

Model update: Mη ← (s,a,r,s')

Planning: n×Q-update from Mη samples

MCTS Steps

1. Select — traverse tree by Q(s,a)

2. Expand — add new leaf node

3. Rollout — simulate to terminal

4. Backup — propagate return G up

Algorithm Selection Guide

Do you have a model?
YES → Dynamic Programming (Policy/Value Iteration)
NO → Model-Free Methods ↓
Episodic only?
YES → Monte Carlo
NO → TD Methods
Continuous actions?
YES → Actor-Critic (DDPG, SAC, PPO)
NO → Q-Learning / SARSA / DQN

📖 Deep Dives

Book-derived scientific depth — theory, proofs, and formal frameworks from Deep Reinforcement Learning with Python (Sanghi)

Convergence Properties of RL Algorithms

Not all RL algorithms are guaranteed to converge. The convergence behaviour depends on three key factors: whether bootstrapping is used, whether learning is on-policy or off-policy, and whether the function approximator is linear or nonlinear. This table (adapted from Sutton & Barto and Sanghi Chapter 5) summarises the landscape:

Algorithm Bootstrapping On/Off Policy Tabular Linear FA Nonlinear FA
Monte Carlo No On-Policy ✓ Yes ✓ Yes ✓ Yes
TD(0) Yes On-Policy ✓ Yes ✓ Yes ⚠ No
TD(0) Yes Off-Policy ✓ Yes ⚠ No ✗ No
Q-Learning Yes Off-Policy ✓ Yes ⚠ No ✗ No
SARSA Yes On-Policy ✓ Yes ✓ Yes ⚠ No
DQN Yes Off-Policy ⚠ Empirical
⚠️

The Deadly Triad

Convergence failures occur when all three of these combine simultaneously:

  • Bootstrapping — using estimates to update estimates
  • Off-policy — behaviour policy ≠ target policy
  • Function approximation — generalising across states

Each individually is fine. Two together is often fine. All three → potential divergence.

Why Monte Carlo Always Converges

MC uses no bootstrapping — targets are actual returns Gt, not estimated values. This means:

  • The update target is stationary (doesn't move as weights change)
  • The update is a true gradient of the MSE loss
  • Stochastic gradient descent convergence theorems apply

The cost: high variance, no online learning, must wait for episode end.

📐

Linear FA + On-Policy TD

This combination converges to the TD fixed point — a point near (but not at) the true MSE minimum. The error bound is:

‖v̂w − vπ2μ1(1−γ) · minw ‖v̂w − vπ2μ

The TD fixed point lies within a bounded factor of the best linear approximation possible.

Key Takeaway: DQN "works" empirically by breaking the deadly triad with two tricks: experience replay (decorrelates off-policy data) and target network (slows the moving target). Neither fully eliminates the triad — they just dampen its instability enough for practical learning.

Semi-Gradient Methods: The Formal Justification

When we use bootstrapped targets with function approximation, we face a fundamental problem: the update target itself depends on the weights w. This makes true gradient descent impossible.

The True Gradient Objective

We want to minimise the Mean Squared Value Error:

J(w) = Σs μ(s) · [vπ(s) − v̂(s, w)]²

Taking the true gradient:

wJ(w) = −2 Σs μ(s) · [vπ(s) − v̂(s, w)] · ∇wv̂(s, w)

The Problem with TD Targets

In TD(0), we replace vπ(s) with the bootstrap target:

Ut = Rt+1 + γ · v̂(St+1, w)

Notice: the target Ut depends on w! So the true gradient would be:

w[Ut − v̂(St, w)]² = −2[Ut − v̂(St, w)] · [∇wv̂(St, w) − γ · ∇wv̂(St+1, w)]

This requires differentiating through both v̂(St, w) and v̂(St+1, w).

🎯 The Semi-Gradient Approximation

Instead, we ignore the derivative of the next-state value. We treat Ut as if it were a fixed target that doesn't depend on w:

w ← w + α · [Ut − v̂(St, w)] · wv̂(St, w)

Only the current state's gradient is computed. The next-state term is dropped.

Why Drop the Next-State Gradient?

  • The target Ut is used as a "label" — differentiating through it would create a circular dependency
  • This mirrors how supervised learning treats targets as fixed
  • For linear FA on-policy, the resulting update still converges (to the TD fixed point)
  • This is exactly what deep learning frameworks do: target_net.detach() in PyTorch stops gradient flow through the target network
📊

Monte Carlo = True Gradient

MC is the only method that performs true gradient descent on J(w), because the return Gt does not depend on w:

w ← w + α · [Gt − v̂(St, w)] · ∇wv̂(St, w)

Here Gt is a fixed observed return. No approximation is made — this is true SGD. That's why MC converges everywhere.

Semi-Gradient TD vs Full-Gradient (Residual Gradient)

PropertySemi-Gradient TDFull-Gradient (Residual)
Gradient computation∇v̂(St, w) only∇v̂(St, w) − γ·∇v̂(St+1, w)
Convergence (linear, on-policy)✓ Yes (TD fixed point)✓ Yes (true minimum)
Convergence (nonlinear)✗ Not guaranteed⚠ Slow, unstable
Computational costOne forward passTwo forward passes
Used in practice✓ Always (DQN, etc.)✗ Rarely

Backup Diagrams: Formal Theory

Backup diagrams are a compact visual language for describing how value information flows in different RL algorithms. Each diagram encodes exactly which transitions are considered and how deep the lookahead goes.

Dynamic Programming
S
a₁
s'
s'
s'
a₂
s'
s'
V(s) = Σa π(a|s) Σs' p(s'|s,a)[r + γV(s')]
  • Full-width: considers ALL actions
  • Full-width: considers ALL next states
  • Depth 1: only one step lookahead
  • Uses model (transition probabilities)
  • Bootstraps from V(s')
Monte Carlo
St
s'
s'
s'
T (terminal)
V(s) ← Gt = Rt+1 + γRt+2 + … + γT−tRT
  • Sample-width: only ONE sampled trajectory
  • Full depth: all the way to episode end
  • No bootstrapping — actual return used
  • No model needed (model-free)
  • High variance, zero bias
TD(0)
St
St+1
V(St) ← Rt+1 + γ·V(St+1)
  • Sample-width: ONE sampled transition
  • Depth 1: stops at next state
  • Bootstraps from V(St+1)
  • Low variance, some bias
  • Online learning possible
TD(n)
St
s'
s'
⋮ (n steps)
St+n
Gt(n) = Rt+1 + … + γn−1Rt+n + γnV(St+n)
  • Sample-width: ONE trajectory
  • Depth n: n-step lookahead
  • Bootstraps at step n
  • Interpolates MC↔TD(0) via n
  • n=1: TD(0); n=∞: MC
The Spectrum: All RL algorithms can be understood as operating along two axes — depth (1-step to full episode) and width (full-width expectation vs single-sample). DP is shallow+wide, MC is deep+narrow, TD is shallow+narrow. TD(n) moves along the depth axis. Planning algorithms like Dyna-Q mix both.

Off-Policy Monte Carlo Control

Off-policy learning separates the policy used to generate experience (behaviour policy b) from the policy being optimised (target policy π). This enables learning about the optimal policy while still exploring.

🎭

Two Policies

PropertyBehaviour Policy bTarget Policy π
RoleGenerates episodesBeing improved
Must cover π?Yes (coverage condition)
Typical formε-soft (explores)Greedy (deterministic)
Used at test time?NoYes
📋

Coverage Condition

For off-policy MC to work, the behaviour policy must cover the target policy:

π(a|s) > 0 ⟹ b(a|s) > 0

Every action the target might take must have non-zero probability under the behaviour policy. Otherwise returns from some trajectories would be missing from the estimate.

Importance Sampling

Episodes are generated under b, but we want to evaluate π. Returns Gt from b must be reweighted by the probability ratio:

ρt:T = ∏k=tT−1 π(Ak|Sk)b(Ak|Sk)

This ratio ρ is the importance sampling ratio — how much more (or less) likely the trajectory was under π compared to b.

⚖️

Ordinary IS

V(s) = Σt∈T(s) ρt:TGt|T(s)|

Pros: Unbiased estimate of Vπ(s)

Cons: High variance (ρ can be very large or very small)

⚖️

Weighted IS

V(s) = Σt∈T(s) ρt:TGtΣt∈T(s) ρt:T

Pros: Much lower variance; consistent estimator

Cons: Biased (but bias → 0 as samples → ∞)

Preferred in practice — variance reduction is usually worth the small bias.

Why Q-Learning is Simpler: Q-Learning achieves off-policy learning without importance sampling by using a one-step bootstrap target: maxa Q(S', a). This directly "imagines" the greedy action regardless of what the behaviour policy did — but at the cost of introducing the deadly triad risk with function approximation.

Incremental Off-Policy MC Update (Weighted IS)

Using an incremental weighted average (avoids storing all returns):

Cn+1 = Cn + Wn
Qn+1(s,a) = Qn(s,a) + WnCn+1 · [Gn − Qn(s,a)]
Wn+1 = Wn · π(At|St)b(At|St)

Early exit: if the behaviour action diverges from the greedy target action, W → 0 and the episode no longer contributes. This stops learning on trajectories irrelevant to π.

Algorithm Landscape

The RL algorithm space can be mapped along multiple axes. Understanding where each algorithm sits helps you pick the right tool for a given problem.

Axis 1: Model-Free vs Model-Based

Model-Free
  • No environment model
  • Learn directly from experience
  • Q-Learning, SARSA, DQN, PPO
  • More data hungry
  • Simpler to implement
←→
Model-Based
  • Learn/use a model of dynamics
  • Can plan with simulated rollouts
  • Dyna-Q, AlphaZero, World Models
  • More sample efficient
  • Model errors can compound

Axis 2: Value-Based vs Policy-Based

Value-Based
  • Learn V(s) or Q(s,a)
  • Policy is implicit (greedy)
  • Q-Learning, DQN, SARSA
  • Works best with discrete actions
←→
Policy-Based
  • Directly optimise π(a|s;θ)
  • No explicit value function
  • REINFORCE, PPO, SAC, DDPG
  • Works with continuous actions
Actor-Critic (Both) — learns both V(s) and π(a|s;θ)

Axis 3: On-Policy vs Off-Policy

On-Policy
  • SARSA, Monte Carlo, PPO
  • Must use current policy's data
  • More stable learning
  • Less sample efficient
←→
Off-Policy
  • Q-Learning, DQN, SAC, DDPG
  • Can reuse old experience (replay buffer)
  • Deadly triad risk
  • More sample efficient

Axis 4: Bootstrapping Depth

TD(0)
1-step
TD(n)
n-step
Monte Carlo
∞-step

TD(λ) provides an elegant weighted combination of all n-step returns simultaneously using eligibility traces.

Algorithm Map

Discrete Actions
Continuous Actions
Model-Free
Value-Based
Q-Learning, SARSA, DQN, Double DQN, Dueling DQN
— (not directly applicable)
Model-Free
Policy-Based
REINFORCE, PPO (discrete)
REINFORCE, PPO, SAC, DDPG, TD3
Model-Free
Actor-Critic
A3C, PPO (discrete)
A3C, PPO, SAC, DDPG
Model-Based
Dyna-Q, MCTS (AlphaGo/Zero)
World Models, Dreamer, MuZero

Generalised Policy Iteration with Function Approximation

Generalised Policy Iteration (GPI) is the universal framework underlying almost all RL algorithms. It describes the interplay between policy evaluation and policy improvement.

📊

Policy Evaluation

Given π, estimate Vπ(s) or Qπ(s,a)

MC Returns TD(0) TD(n) LSTD
makes greedy w.r.t. V
evaluates improved π
🎯

Policy Improvement

Given Q(s,a), derive a better policy

Greedy ε-greedy Softmax UCB

GPI + Function Approximation

With tabular methods, GPI is well-understood and converges to the optimal policy. With function approximation, things become more complex:

🔄

What Changes with VFA

  • Evaluation is approximate: V̂(s, w) ≠ Vπ(s) exactly
  • Improvement causes distribution shift: changing π changes which states are visited (μ changes)
  • Generalisation cuts both ways: updating one state affects others via shared weights
  • No guarantee of convergence to exact optimum — GPI finds a near-optimal policy
📐

Tabular Lookup as Special Case of Linear FA

Tabular RL is mathematically a special case of linear function approximation where the feature vector is a one-hot encoding:

x(s) = es = [0, …, 1, …, 0]
V̂(s, w) = x(s)ᵀw = ws

Each weight ws is exactly the value of state s. No generalisation occurs — each state is independent. This proves tabular methods are a special case of the general VFA framework.

Semi-Gradient SARSA (GPI in Practice)

w ← w + α[R + γQ̂(S', A', w) − Q̂(S, A, w)] · ∇wQ̂(S, A, w)

This is GPI in action: the TD target serves as the "evaluation" signal, while the ε-greedy policy derived from Q̂ is the "improvement" step. Both happen simultaneously every timestep.

🧠

DQN as GPI

DQN implements GPI at scale:

  • Evaluation: minimise loss (Q̂(S,A,w) − y)² where y = R + γ·maxaQ̂(S', a, w⁻)
  • Improvement: ε-greedy w.r.t. Q̂(·, ·, w)
  • Target network w⁻: stabilises the evaluation target
  • Replay buffer: decorrelates samples for stable SGD
The Big Picture: Every major RL algorithm — DP, MC, TD, Q-Learning, SARSA, DQN, PPO, SAC — is an instance of GPI. They differ only in how they do policy evaluation (depth, width, bootstrapping) and how they represent the policy (tabular, linear FA, neural net). Understanding GPI means understanding all of RL at once.