NOVA IMS · 2nd Semester 2026

Reinforcement Learning

Your complete interactive study guide covering all lectures, MDPs, Dynamic Programming, Monte Carlo, Temporal Difference Learning, and Deep RL.

Intro to RL

Core concepts, four pillars, reward hypothesis, agent components and categories.

Lecture 1

MDPs & Bellman

Markov Decision Processes, value functions, Bellman equations and optimality.

Lecture 2

DP & Monte Carlo

Policy iteration, value iteration, and model-free MC prediction and control.

Lecture 3

Model-Free Methods

TD Learning, TD(n) multi-step, SARSA, GLIE convergence, MC vs TD trade-offs.

Lecture 4

Model-Free Control

Q-Learning, Double Q-Learning, overestimation bias, value function approximation.

Lecture 5

VFA & Planning

Gradient descent VFA, DQN full algorithm, Model-Based RL, Dyna-Q, MCTS.

Lecture 6

Policy-Based RL

Policy gradients, score function, Actor-Critic, PPO, SAC, TD3 and modern deep RL algorithms.

Lecture 7

Quiz

25 questions across all topics. Test your knowledge and prepare for the exam.

Practice

Your Progress

Introduction to Reinforcement Learning

Lecture 1 · Inês Castelhano · April 2026

What is Reinforcement Learning?

        RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment. It receives a numerical reward signal and gradually improves its behaviour through trial and error.
      

RL vs Other Machine Learning Paradigms

	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data	Labelled examples	Unlabelled data	Experience / interaction
Feedback	Immediate, per sample	None (structure)	Delayed reward signal
Goal	Predict / classify	Find patterns	Maximise cumulative reward
Time	i.i.d. data	i.i.d. data	Sequential, non i.i.d.

The Agent–Environment Interaction Loop

Agent

Action A_t

State S_t+1 Reward R_t+1

Environment

1Observe state S_t

2Select action A_t using policy π

3Receive reward R_t+1

4Transition to new state S_t+1

Reward Hypothesis

        Reward Hypothesis: All goals can be described by the maximisation of expected cumulative reward.
      

Reward R_t

Immediate feedback at each time step. Scalar signal indicating how well the agent is doing right now.

Return G_t

Total discounted future reward from time t. The agent's goal is to maximise G_t.

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ...

Discount Factor γ (Gamma)

γ ∈ [0, 1] controls how much future rewards are valued relative to immediate ones.

γ = 0.9

The Four Pillars of RL

1. Optimisation

Find an optimal way to make decisions, yielding best or very good outcomes.

Restaurant example: Maximise total dinner quality over your stay.

2. Delayed Consequences

Decisions now have long-term impact. Early rewards can hinder the future. Credit must be assigned to past actions.

Restaurant example: Trying a bad place today may lead to a great discovery later.

3. Exploration

Learning about the world by making decisions. Only reward is received for the decision made — this shapes what is learned.

Restaurant example: Trying new places may reveal hidden gems or disappointments.

4. Generalisation

Policy maps experience to action. The agent must generalise from seen situations to unseen ones.

Restaurant example: If you like Italian pizzerias, try similar restaurants.

What Makes RL Different?

No supervisor — only a reward signal

Feedback is delayed, not instantaneous

Time matters — sequential, non i.i.d. data

Humans can learn without examples of optimal behaviour

Core Agent Components

Policy π

The agent's core behaviour function — maps states to actions.

Deterministic: π(s) = a — always take action a in state s

Stochastic: π(a|s) = P(A=a|S=s) — probability distribution over actions

Value Function V(s)

Prediction of future reward from a given state. Used to evaluate states.

v_π(s) = 𝔼_π[G_t | S_t = s]

A state with low immediate reward might have high value if it reliably leads to high-reward states.

Model

Agent's internal representation of the environment.

Transition Model P: Predicts next state given s and a

Reward Model R: Predicts immediate reward for taking a in s

⚠️ A model does not immediately provide a good policy — planning is still needed.

Agent State & History

History H_t: Full sequence of O_t, A_t, R_t — too large to use directly

Agent State S_t: Compact summary of history: S_t = u(H_t)

Environment State: Usually invisible to the agent

Prediction vs. Control

Prediction

Evaluate the future for a given policy. Estimate the value function to measure how much reward the agent will accumulate.

Estimate v_π

Control

Optimise the future — find the best policy that maximises expected cumulative reward.

Find π*

        These problems are strongly related. Many RL algorithms alternate between prediction (evaluate policy) and control (improve policy) — known as Generalised Policy Iteration (GPI).
      

Learning vs. Planning

Learning

Environment is initially unknown. Agent interacts to discover optimal behaviour.

Robot exploring a new building

Planning

A model of the environment is given or learned. Agent plans inside the model without further real interaction.

AlphaGo searching future game states

Observability, State & Policy Types

Environment State vs. Agent State

🌍 Environment State

The environment's internal representation — everything about the world. Usually invisible to the agent. Even when visible, it may contain large amounts of irrelevant information.

e.g. The full physics simulation state in a robotics environment — position, velocity, internal forces of every object

🤖 Agent State

The agent's own representation of what it needs to make decisions. A function of the history H_t — compact summary of all past observations, actions, and rewards.

S_t^a = f(H_t)

e.g. The last 4 game frames stacked together as input to DQN

History: The Full Record

H_t = O₁, A₁, R₁, O₂, A₂, R₂, ..., O_t, A_t, R_t

Contains all past information. Problem: grows unboundedly with time — too large to use directly. The agent state is a compact function of this history.

Fully Observable vs. Partially Observable

👁️ Fully Observable (MDP)

Agent directly observes the environment state: S_t^a = S_t^e. The agent has complete information. Formally modelled as a Markov Decision Process (MDP).

Optimal deterministic policy always exists
Standard RL algorithms apply directly

Chess (both players see the full board), Frozen Lake, CartPole

🙈 Partially Observable (POMDP)

Agent only receives partial observations O_t ≠ S_t^e. Must maintain a belief state (probability distribution over possible environment states). Formally a Partially Observable MDP (POMDP).

Optimal policy may be stochastic
Requires memory (RNN, belief state)

Poker (hidden cards), autonomous driving (occluded objects), most real-world robotics

Policy Types

Deterministic Policy

π(s) = a

Maps each state to a single specific action. Simple and easy to analyse. Optimal in fully observable MDPs.

Used in: DQN (greedy argmax), TD3, DDPG
Pros: predictable, no variance in action selection
Cons: requires separate exploration mechanism

Stochastic Policy

π(a|s) = P[A_t=a|S_t=s]

Maps each state to a probability distribution over actions. Necessary for POMDPs and game-theoretic settings.

Used in: PPO, SAC, REINFORCE, A3C
Pros: built-in exploration, handles partial observability
Cons: harder to analyse and optimise

Exploration vs. Exploitation

Exploration

Try new actions to discover potentially better strategies. Find more information about the environment.

Exploitation

Use what you already know to collect high rewards right now. Exploit known information to maximise reward.

        Dilemma: A purely exploitative agent gets stuck in local optima (never discovers better strategies). A purely exploratory agent collects experience but never leverages it to improve. Effective RL must balance both.
      

Interactive: Explore vs. Exploit

Slide to change the exploration rate ε (probability of exploring):

ε = 0.3

Agent Categories

By Internal Components

Category	Policy	Value Function	Description
Value-Based	Implicit	✅ Yes	Learn value function; derive policy greedily (e.g. Q-Learning)
Policy-Based	✅ Yes	❌ No	Learn policy directly; no explicit value function (e.g. REINFORCE)
Actor-Critic	✅ Yes	✅ Yes	Learn both — actor updates policy, critic evaluates it

By Use of a Model

Model-Free

No explicit model of the environment. Learns from direct interaction. Examples: Q-Learning, SARSA, Policy Gradient.

Model-Based

Builds or uses a model of the environment. Can plan ahead. Examples: Dynamic Programming, AlphaGo (MCTS).

Markov Decision Processes

Lecture 2 · Tabular Value-Based RL

The Markov Property

Markov Property: "The future is independent of the past, given the present." A state St is Markov if and only if:

P(St+1 | St) = P(St+1 | S1, ..., St)

Once S_t is known, the history H_t is no longer needed. The current state is a sufficient statistic of the future.

Full Observability (MDP)

Agent sees the full environment state. Observation = Environment State = Agent State.

Partial Observability (POMDP)

Agent cannot directly observe environment state. Must construct a Markov state from history.

MDP: The Formal Definition

A finite MDP is defined by a 5-tuple: (S, A, P, R, γ)

S

State Space — all possible states the agent can be in

A

Action Space — all possible actions the agent can take

P

Transition Probability — P(s' | s, a): probability of moving to s' from s with action a

R

Reward Function — R(s, a): expected immediate reward for taking action a in state s

γ

Discount Factor — γ ∈ [0,1]: how much future rewards are valued

Delivery Driver Mapping

MDP Component	Driver Problem
S — States	Location, active deliveries, time of day
A — Actions	Route choices, next destination
P — Transitions	Traffic uncertainty (same route → different outcomes)
R — Reward	Earnings from delivery minus delay penalty
γ — Discount	How much current delivery matters vs future opportunities

Episodic vs Continuing Tasks

Episodic Tasks

Break into independent episodes. Each episode has a terminal state. Return is a finite sum.

Driver shift ends at end of day

Continuing Tasks

No natural endpoint. Return is an infinite sum — discount factor is essential to ensure convergence.

Driver works continuously without stopping

Value Functions

State Value Function V_π(s)

How good is it to be in state s following policy π?

v_π(s) = 𝔼_π[G_t | S_t = s]

= Expected return starting from state s, following π forever.

Action Value Function Q_π(s,a)

How good is it to take action a in state s following policy π?

q_π(s,a) = 𝔼_π[G_t | S_t = s, A_t = a]

= Expected return starting from s, taking a, then following π.

Relationship between V and Q:vπ(s) = Σa π(a|s) · qπ(s, a)

        State value = expected action value under policy π

Optimal Value Functions

v*(s)

v*(s) = max_π v_π(s)

Maximum value achievable from state s, over all possible policies.

Q*(s,a)

Q*(s,a) = max_π q_π(s,a)

Maximum value of taking action a in state s, over all policies.

Bellman Equations

        Key Insight: The value of a state today equals the immediate reward plus the discounted value of what follows. This recursive structure is the foundation of all tabular RL algorithms.
      

Bellman Expectation Equations

For V_π

v_π(s) = Σ_aπ(a|s) Σ_s'P(s'|s,a)[R(s,a) + γv_π(s')]

For Q_π

q_π(s,a) = R(s,a) + γ Σ_s'P(s'|s,a) Σ_a'π(a'|s')q_π(s',a')

Bellman Optimality Equations

Replace policy expectation with max — gives us the optimal value functions.

Optimal V*

v*(s) = max_a Σ_s'P(s'|s,a)[R(s,a) + γv*(s')]

Optimal Q*

Q*(s,a) = R(s,a) + γ Σ_s'P(s'|s,a) max_a'Q*(s',a')

Breaking Down the Bellman Equation

v(s)

=

R(s,a)

+

γ

·

v(s')

Current value Immediate reward Discount Future value

Frozen Lake — MDP Example

A classic grid-world problem: navigate from Start (S) to Goal (G) without falling into Holes (H).

S

Start

F

Frozen (safe)

H

Hole (terminal)

G

Goal

MDP Components

States: 16 grid cells (4×4)

Actions: Left, Right, Up, Down

Reward: +1 at Goal, 0 elsewhere

Transitions: Slippery — stochastic (1/3 chance each of 3 directions)

Dynamic Programming & Monte Carlo

Lecture 3 · Planning by Dynamic Programming

Dynamic Programming

        DP solves complex problems by breaking them into simple sub-problems, computing and storing solutions, and reusing them when the same sub-problem occurs.
        
Key assumption: Full knowledge of the environment model (P and R).

Optimal Substructure

The optimal solution can be composed from optimal solutions to sub-problems.

Overlapping Subproblems

Sub-problems recur many times — their cached solutions can be reused.

        Why MDPs fit DP: Bellman equations have recursive structure → value functions store sub-solutions.
      

DP Algorithm Summary

Algorithm	Bellman Equation Used	Problem Type
Iterative Policy Evaluation	Expectation Equations	Prediction
Policy Iteration	Expectation + Greedy Improvement	Control
Value Iteration	Optimality Equations	Control

Policy Iteration

1

Initialise arbitrary policy π and value function V

E

Policy Evaluation — iterate Bellman expectation equation until V_π converges

I

Policy Improvement — greedy update: π'(s) = argmax_a Q_π(s,a)

?

Convergence check — if π' = π, stop. Else go back to Evaluation.

        Generalised Policy Iteration (GPI): The general idea of letting policy evaluation and improvement interact — the backbone of virtually all RL algorithms.
      

Policy Evaluation Formula

V_k+1(s) ← Σ_aπ(a|s) Σ_s'P(s'|s,a)[R(s,a) + γV_k(s')]

At each iteration, every state's value is updated using the current estimates of neighbouring states. Repeating this until convergence gives V_π.

Value Iteration

        Value Iteration combines evaluation and improvement in one step — no need to wait for full policy evaluation. Uses the Bellman optimality equation directly.
      

V_k+1(s) ← max_a Σ_s'P(s'|s,a)[R(s,a) + γV_k(s')]

Policy Iteration

Full policy evaluation at each step
Guaranteed convergence to π*
Slower per iteration, fewer iterations

Value Iteration

One backup per iteration
Also converges to optimal policy
Faster per iteration, more iterations

Asynchronous DP

Standard DP updates all states in parallel (synchronous). Asynchronous DP updates states individually in any order — more efficient in practice.

In-Place DP

Only store one copy of value function — use immediately updated values.

Prioritised Sweeping

Update states with the largest Bellman error first. Maintain a priority queue.

Real-Time DP

Only update states relevant to the current agent trajectory.

DP Limitations

Requires a perfect model of the environment

Curse of dimensionality — state space grows exponentially

Full-width backups — expensive for large problems

Asynchronous Dynamic Programming

      Standard (synchronous) DP updates all states on every sweep — expensive and unnecessary. Asynchronous DP backs up states individually in any order, and can significantly reduce computation while still guaranteeing convergence (if all states continue to be selected).
    

Three Asynchronous DP Ideas

1. In-Place DP

Synchronous value iteration stores two copies of V (current and next). In-place DP stores only one copy and uses newly updated values immediately in the same sweep.

V(s) ← max_a Σ_s' P[R + γV(s')] (using latest V)

Can converge faster because updated estimates propagate within the same iteration.

2. Prioritised Sweeping

Use the magnitude of the Bellman error to guide which state to update next. Back up the state with the largest remaining error — focus compute where it matters most.

priority(s) = |R + γV(s') − V(s)|

Requires knowledge of reverse dynamics (predecessor states)
Implemented efficiently with a priority queue
Update Bellman error of affected states after each backup

3. Real-Time DP

Only update states that are relevant to the agent. If the agent is in state S, update V(S) or states it expects to visit soon. Ignores irrelevant distant states entirely.

Interleaves planning with real agent interaction
Focuses compute on states the agent actually visits
Natural fit for online RL settings

Uber example: focus updates on busy zones, not empty zones

Full-Width Backups vs Sample Backups

Full-Width Backups (Standard DP)

For each backup: every successor state and action is considered. Uses the true model of transitions and rewards.

Effective for medium-sized problems (millions of states)
Exact value updates — no approximation
Curse of dimensionality: even one backup can be too expensive
Requires complete knowledge of MDP dynamics

Sample Backups (Model-Free)

Use sampled rewards and sampled transitions instead of the full distribution. Leads naturally to Monte Carlo and TD methods.

Model-free: no advance knowledge of MDP required
Breaks curse of dimensionality through sampling
Cost of backup is constant (one sample)
Foundation of all model-free RL

DP Limitations

🔮

Requires a Perfect Model

DP needs full knowledge of P(s'|s,a) and R(s,a). In real-world problems the model is unknown or only partially known.

📈

Curse of Dimensionality

The state space grows exponentially with the number of state variables. For n variables with k values each: kⁿ states. Quickly becomes intractable.

⚙️

Full-Width Backups

Each sweep considers every state. For large MDPs a single sweep is prohibitively expensive. Async DP and sample backups are the remedies.

        The Transition: DP's limitations motivate Monte Carlo (remove model requirement) and TD Learning (remove episode requirement + break curse through sampling). All three follow the same GPI framework.
      

Monte Carlo Methods

        Monte Carlo: Model-free method that learns directly from complete episodes of experience. Instead of expected return (DP), MC uses the mean return across sampled episodes.
      

Key requirement: MC methods are applied only to episodic tasks — must wait until episode ends.

First-Visit MC

Average returns only from the first occurrence of each state per episode.

N(S_t) ← N(S_t) + 1
S(S_t) ← S(S_t) + G_t
V(S_t) = S(S_t) / N(S_t)

Every-Visit MC

Average returns every time the state is visited in an episode.

Both converge to v_π(s) as N(s) → ∞ by the law of large numbers.

Incremental MC Updates

Instead of storing all returns and averaging at the end, update V(s) incrementally after each episode using the running mean:

N(S_t) ← N(S_t) + 1

V(S_t) ← V(S_t) + ¹⁄_{N(S_t)} · (G_t − V(S_t))

For non-stationary problems, use a fixed learning rate α instead of 1/N — this gives more weight to recent episodes and "forgets" old ones:

V(S_t) ← V(S_t) + α · (G_t − V(S_t))

α can be: 1/N(S_t) (exact mean), a fixed small value (exponential moving average), or even α=1 (pure online update, discard history completely). The error term (G_t − V(S_t)) is the MC prediction error.

MC Limitations

Only works for episodic settings

High variance returns — needs lots of data

Cannot update until episode ends — slow for long episodes

DP vs Monte Carlo — Comparison

Property	Dynamic Programming	Monte Carlo
Model Required?	✅ Yes (full model)	❌ No (model-free)
Bootstrapping?	✅ Yes	❌ No (uses actual returns)
Works on Continuing Tasks?	✅ Yes	❌ Episodic only
Update timing	Each step	End of episode only
Bias	Low (exact model)	Zero bias
Variance	Low	High
Curse of dimensionality	Severe	Less severe (sampling)

        Teaser: Temporal Difference (TD) learning combines the best of both — bootstrapping from DP without needing a model, and learning from incomplete episodes like DP can.
      

Model-Free Methods

Lecture 4 · Temporal Difference Learning · Inês Castelhano · May 2026

Temporal Difference Learning

        "If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be TD learning." — Sutton & Barto

        TD is model-free (no knowledge of MDP transitions/rewards), learns directly from experience at each step, and uses bootstrapping — updating estimates using other estimates.

TD(0) Update Rule

V(S_t) ← V(S_t) + α · [R_t+1 + γ·V(S_t+1) − V(S_t)]

R_t+1 + γV(S_t+1)

TD Target
Observed reward + bootstrapped future

−

V(S_t)

Current Estimate

=

δ_t

TD Error
Drives the update

Trading Example Mapping

TD Concept	Stock Trading Interpretation
State S_t	Current market conditions (price, volatility, holdings)
Action A_t	Buy / Hold / Sell
Reward R_t+1	Immediate profit or loss from the trade
TD Target	Observed short-term profit + estimated future portfolio value
TD Error δ_t	Difference between expected and observed market outcome
Update timing	After every market movement, not at market close

          Advantages over MC:

          • Updates at every step — online learning

          • Works in continuing (non-episodic) settings

          • Lower variance (only one step of randomness)

          • Can learn before episode ends

          Advantages over DP:

          • No model required (P and R unknown)

          • Learns from sampled experience

          • Scales to large problems without full sweeps

Multi-Step TD: TD(n) and TD(∞)

        TD(0) bootstraps after just 1 step. We can look further ahead before bootstrapping — this gives TD(n), interpolating between TD and MC.
      

TD(0)

1 real reward → bootstrap

TD(2)

2 rewards → bootstrap

TD(n)

n rewards → bootstrap

TD(∞) = MC

Full episode → no bootstrap

TD(n) Update Formula

G_t⁽ⁿ⁾ = R_t+1 + γR_t+2 + ... + γ^n-1R_t+n + γⁿV(S_t+n)

V(S_t) ← V(S_t) + α[G_t⁽ⁿ⁾ − V(S_t)]

Trading Example — TD(3)

Observe 3 days of market returns, then use your current estimate of future portfolio value. No need to wait for the full trading period.

Trading Example — TD(∞)

Wait until the end of the full trading episode and use the complete realized return — equivalent to Monte Carlo.

        Key insight: Larger n = less bias, more variance (more like MC). Smaller n = more bias, less variance (more like TD). Optimal n depends on the problem.
      

SARSA — TD for Action Values

        Apply TD to the Q-function instead of V. SARSA is named after the tuple it uses: (St, At, Rt+1, St+1, At+1).
      

Q(S_t,A_t) ← Q(S_t,A_t) + α[R_t+1 + γ·Q(S_t+1,A_t+1) − Q(S_t,A_t)]

SARSA Algorithm (On-Policy TD Control)

Initialize Q(s,a) for all s∈S, a∈A; set Q(terminal,·) = 0

For each episode:

Initialize S; choose A from S using ε-greedy on Q

Repeat for each step until S is terminal:

Take action A → observe R, S'

Choose A' from S' using ε-greedy on Q

Q(S,A) ← Q(S,A) + α[R + γQ(S',A') − Q(S,A)]

S ← S'; A ← A'

GLIE — Greedy in the Limit with Infinite Exploration

          GLIE is the formal condition for convergence to optimality in model-free control.
        

1

All state-action pairs explored infinitely:
∀s,a: lim_t→∞ N(s,a) = ∞
Every (state, action) pair must be tried infinitely often — no pair is ignored forever.

2

Policy converges to greedy:
lim_t→∞ π_t(a|s) = 𝟙[a = argmax_a' q(s,a')]
Eventually exploit more and explore less — ε must decay to 0.

          Theorem: GLIE model-free control converges to the optimal action-value function: q → q*.

          Tabular SARSA converges to q* if the policy is GLIE.

Trading Example

Early in training, try all combinations of market conditions and actions (buy/hold/sell) many times — even bad ones — so no strategy is left unexplored. Over time, reduce ε so the agent exploits more of what it learned.

Monte Carlo vs TD Learning

Property	Monte Carlo	TD Learning
Update timing	End of episode only	After every step (online)
Episodic tasks only?	✅ Yes	❌ No — works for continuing too
Bootstrapping	❌ No — uses actual return	✅ Yes — uses estimated V(S')
Bias	Zero (unbiased)	Some bias (bootstrapping)
Variance	High — full episode randomness	Low — only 1-step randomness
Markov property needed?	❌ No	✅ Yes — exploits Markov property
Memory needed	Store full episode	Only current transition

Monte Carlo — Pros & Cons

Good convergence properties

Zero bias — unbiased estimate of vπ

Works in partially-observable environments (no Markov needed)

High variance — many random steps

Requires complete episodes

TD Learning — Pros & Cons

Low variance — 1-step update

Usually more efficient than MC

Works in non-episodic settings

Some bias (bootstrapped estimate)

More sensitive to initial values

        In the tabular case: Both MC and TD converge to the true value function vπ(s). The choice depends on the environment (episodic vs continuing) and the bias-variance trade-off needed.
      

On-Policy vs Off-Policy Learning

On-Policy Learning

The agent learns the value of the same policy it is currently following. The behaviour policy (generating actions) equals the target policy (being improved).

Learn about π using experience from π

Example: SARSA — explores with ε-greedy and improves the same ε-greedy policy

Off-Policy Learning

The agent learns the value of a different (target) policy from the behaviour policy generating data. Can learn from other agents or old data.

Learn about π using experience from μ

Example: Q-Learning — explores with ε-greedy but learns the greedy policy

Why Off-Policy Matters

Learn from demonstrations by an expert (imitation)

Re-use old experience (experience replay)

Learn multiple policies simultaneously

Learn from human or random behaviour

DP vs Monte Carlo vs TD — Unified View

        These methods differ in how much they rely on sampling (model vs experience) and how far they look ahead before updating (depth of backup).
      

Property	Dynamic Programming	Monte Carlo	TD Learning
Model required?	✅ Yes	❌ No	❌ No
Bootstrapping?	✅ Yes	❌ No	✅ Yes
Sampling?	❌ No (expectation)	✅ Yes	✅ Yes
Update timing	Every step (full sweep)	End of episode	Every step
Episodic only?	❌ No	✅ Yes	❌ No
Bias	Low	Zero	Some
Variance	Low	High	Lower than MC

Backup Diagrams — Visual Summary

DP — Full width, shallow

s

a

s'

a

s'

All actions & states. Full model. 1-step expectation.

MC — Narrow, deep

s

a

s'

···

T

Sampled path to terminal state. Full return.

TD(0) — Narrow, shallow

s

a

s'

Single sampled step. Bootstraps from V(s').

Model-Free Control

Lecture 5 · Q-Learning, Double Q-Learning, VFA Intro · Inês Castelhano · May 2026

Q-Learning — Off-Policy TD Control

        Q-Learning estimates the value of the greedy policy regardless of which policy generates the data. It is off-policy: the behaviour policy explores, but Q-Learning targets the optimal action.
      

Q(S,A) ← Q(S,A) + α[R + γ · max_a'Q(S',a') − Q(S,A)]

Q-Learning Algorithm (Off-Policy TD Control)

Initialize Q(s,a) for all s∈S, a∈A arbitrarily; Q(terminal,·) = 0

For each episode:

Initialize S

Repeat for each step until S is terminal:

Choose A from S using ε-greedy on Q (behaviour policy)

Take action A → observe R, S'

Q(S,A) ← Q(S,A) + α[R + γ max_a'Q(S',a') − Q(S,A)]

S ← S'

        Theorem: Q-Learning control converges to the optimal action-value function q → q*, as long as we take each action in each state infinitely often.
      

Autonomous Driving Example

A self-driving car learns the fastest route to a destination by always estimating future decisions as optimal, even while still exploring different (suboptimal) driving behaviours. It learns from the actions it wishes it would take.

Interactive Q-Table Demo

Watch Q-values update step by step in a simple 3-state world (use Left/Right to navigate toward Goal):

Q-Learning Overestimation Problem

        Classical Q-learning has an upward bias problem:

        It uses the same values to both select and evaluate actions. With noisy approximations, overestimated values are selected more often, propagating the bias.

Driving Example

A self-driving car incorrectly estimates that a narrow, high-speed route is safer and faster than it really is — because a few successful experiences produced overly optimistic Q-values.

Double Q-Learning — Decoupling Selection from Evaluation

          Solution: Maintain two independent Q-functions: Q and Q*. Use one to select the action and the other to evaluate it.
        

Update for Q (when Q is selected)

Q(S,A) ← Q(S,A) + α[R + γQ*(S', argmax_a'Q(S',a')) − Q(S,A)]

Update for Q* (when Q* is selected)

Q*(S,A) ← Q*(S,A) + α[R + γQ(S', argmax_a'Q*(S',a')) − Q*(S,A)]

Double Q-Learning

Initialize Q(s,a) and Q*(s,a) for all s, a

For each step:

Choose A using combined policy (e.g. average of Q and Q*)

Observe R, S'

With 50% probability update Q using Q* for evaluation (or vice versa)

Converges to the optimal policy under the same conditions as Q-learning

Significantly reduces overestimation bias

Can generalise to Double SARSA and other algorithms

Not necessary if target policy and value function are uncorrelated (e.g. pure prediction)

SARSA vs Q-Learning — Complete Comparison

Property	SARSA	Q-Learning
Policy type	On-policy	Off-policy
Target action	A' sampled from behaviour policy (ε-greedy)	max_a'Q(S',a') — greedy target
Update	Q(S,A) + α[R+γQ(S',A')−Q(S,A)]	Q(S,A) + α[R+γmax Q(S',a')−Q(S,A)]
Safety	More conservative — safer in dangerous environments	Can be more aggressive — ignores current risk
Convergence	Converges to q* if GLIE (ε→0)	Converges to q* as long as each (s,a) visited ∞ often
Overestimation	Less prone (stochastic target)	Prone — uses greedy max
Best for	Safety-critical tasks, continuous exploration	Finding optimal policy efficiently

        Key difference in one sentence: Q-learning uses a greedy target policy; SARSA uses a stochastic sample from the behaviour policy as its target.
      

Why Tabular Methods Fail — Motivation for VFA

        Lookup tables store one value per state (or state-action pair). This breaks down for large or continuous state spaces.
      

Memory

Too many states/actions to store in memory

Speed

Too slow to learn each state individually

Observability

Individual states often not fully observable

Generalisation

Each state learned independently — no transfer between similar states

Function Approximation — The Solution

1

ESTIMATE
Approximate value function using a parameterised function with weights w: v̂(s;w) ≈ vπ(s)

2

UPDATE
Adjust weights w using MC or TD learning to minimise prediction error

3

GENERALISE
Updating w for one state automatically improves predictions for similar states

Classes of Function Approximators

Type	Advantages	Disadvantages
Tabular	Strong theory, exact values	Does not scale or generalise
Linear	Reasonable theory, efficient, stable	Requires good hand-crafted features
Non-Linear	Scales well, learns features automatically	Less well-understood theory, harder to train
Deep Neural Networks	Often best performance in practice	Less theory, harder to train, needs much data

Challenges of Function Approximation in RL

Unlike supervised learning, RL has specific properties that make function approximation harder:

Experience is not i.i.d.

Successive time-steps are correlated. Supervised learning assumes independent, identically distributed samples — that assumption fails in RL.

Policy affects data distribution

The policy determines which data is collected, which affects what is learned — creating a tight feedback loop between learning and experience.

The Deadly Triad

          Algorithms combining all three of the following elements may diverge — this is known as the Deadly Triad.
        

Bootstrapping

Using TD targets (estimates of estimates). Introduces bias and non-stationarity in targets.

+

Off-Policy Learning

Learning from data generated by a different policy. Mismatched distributions between experience and target policy.

+

Function Approximation

Using parameterised functions instead of tables. Small errors in approximation can compound.

          Non-stationarity sources: Changing policies alter target values and data distribution; TD bootstrapping creates moving targets; large state spaces prevent stable anchoring.
        

Value Function Approximation & Planning

Lecture 6 · VFA, DQN, Model-Based RL, Dyna-Q, MCTS · Inês Castelhano · May 2026

Why Do We Need Value Function Approximation?

      Tabular RL stores one value per state. Real-world problems have millions or infinite states. Function approximation (VFA) lets us generalise across similar states using a compact parameterised function v̂(s;w).
    

💾

Memory

Too many states/actions to store in a table. Backgammon has ~10²⁰ states; Go has ~10¹⁷⁰.

⏱️

Speed

Too slow to visit and learn each state individually. Most states may never be seen during training.

👁️

Observability

Individual environment states often not fully observable. Partial observability means we can't index a table by state.

🔗

Generalisation

Each state is learned independently — no transfer between similar states. VFA automatically generalises across states.

Function Approximation: Three-Step Process

1

Estimate

Approximate value function using a parameterised function v̂(s;w) with weights w. The function can be linear, neural network, or any differentiable model.

→

2

Update

Adjust weights w using MC or TD learning to minimise prediction error. Gradient descent moves w in the direction of lower loss.

→

3

Generalise

Updating w for one state automatically improves predictions for similar states. This is the key advantage over tabular methods.

Classes of Function Approximators

Approximator	Advantages	Disadvantages
Tabular	Strong theory, exact values, guaranteed convergence	Does not scale or generalise — one value per state
Linear	Reasonable theory, efficient updates, stable convergence	Requires good hand-crafted features; limited expressiveness
Non-Linear	Scales well, can represent complex functions	Less well-understood theory, harder to train
(Deep) Neural Networks	Often performs best; learns features from raw inputs (pixels)	Less well-understood theory; requires careful training tricks

      Key Principle: The policy, value function, model, and agent state update are all functions. We want to learn all of them from experience. If there are too many states, we need to approximate. When using neural networks: this is called Deep RL.
    

Gradient-Based Learning for Value Functions

      Gradient-descent methods are the most widely used function approximation methods in RL. The parameter vector w defines v̂(s;w) — a smooth differentiable function. We minimise prediction error by following the gradient.
    

Incremental Gradient Descent

w ← w + α · δ · ∇_wv̂(s,w)

Update weights after each experience sample
Small step in direction of reduced error
Works online during episode
Can follow non-stationary targets

Stochastic Gradient Descent (SGD)

w ← w − α ∇_wL(w)

Randomly samples from experience
Estimates the true gradient
Converges to local minimum with decaying α
Unbiased estimate of full gradient

Specific Challenges in RL (vs Supervised Learning)

🔄 Experience is Not i.i.d.

Successive timesteps are correlated. Standard supervised learning assumes independent, identically distributed samples — that assumption fails in RL. Consecutive (s,a,r,s') tuples share context.

🎭 Policy Affects Data

The agent's policy determines which data is collected. This creates a tight feedback loop: what we learn changes the policy, which changes what data we see next.

🌊 Changing Policy

Policy changes alter both the target values and the data distribution simultaneously — a double non-stationarity problem.

🎯 Bootstrapping

TD methods use current estimates as targets, which themselves keep changing. We're chasing a moving target — and the target moves because of our own updates.

🌍 Non-Stationary Targets

Other learning agents in multi-agent settings shift the effective dynamics. Even solo agents face non-stationarity due to policy improvement.

♾️ Large State Space

In continuous state spaces, we never visit the exact same state twice. No stable anchor point for estimates — states seen only once or never during training.

Update Targets: MC vs TD

Method	Update Target U_t	Gradient Type	Properties
MC + VFA	Actual return G_t	True gradient (G_t fixed)	Unbiased, high variance, no bootstrapping
TD(0) + VFA	R + γv̂(S';w)	Semi-gradient (ignore ∇v̂(S'))	Biased, low variance, fast, can diverge
TD(n) + VFA	n-step return G⁽ⁿ⁾	Semi-gradient	Intermediate bias/variance trade-off

        Semi-Gradient: TD with VFA doesn't differentiate through the bootstrap target v̂(S';w). We treat the target as fixed — only ∇wv̂(St;w) is computed. Simpler and faster, but not a true gradient step. Prone to the Deadly Triad.
      

Linear Function Approximation

      The simplest differentiable approximator. Values are a linear combination of hand-crafted features.
      Well-understood theory, stable convergence for on-policy TD, computationally efficient updates.
    

v̂(s;w) = w^T · x(s) = Σ_j w_j · x_j(s)

Feature Vector x(s)

A fixed feature map converting each state s into an n-dimensional real-valued vector. Features encode relevant state information — must be chosen by the designer.

🚗 Driving: [speed, lane_position, obstacle_distance]

The gradient of v̂(s,w) w.r.t. w is simply x(s) — making updates very efficient:

w ← w + α[v_target − v̂(s,w)] · x(s)

Weight Vector w

The learnable parameters. Each weight w_j corresponds to feature j. Updating w affects the value of every state that has that feature active — this is generalisation.

🧠 Linear FA: updating w_speed improves estimates for all states with similar speed

Key: Tabular and state aggregation methods are special cases of linear FA.

State Aggregation Methods (Feature Encoding Strategies)

Coarse Coding

Circles/regions in continuous space. Feature x_j(s) = 1 if state s lies inside region j, else 0. Multiple overlapping regions for one state.

Small circles → narrow generalisation
Large circles → broad generalisation
Asymmetric shapes → directional generalisation

Tile Coding

Overlay multiple regular grids (tilings) offset from each other. State is represented by the set of tiles it falls in across all tilings.

Efficient for continuous spaces
Control generalisation by tile width
Binary features — very fast computation
Multiple tilings → fine resolution

Radial Basis Functions (RBF)

Each feature is a Gaussian centred on a prototype state c_j. Smooth, continuous generalisation — the closer s is to c_j, the higher x_j(s).

x_j(s) = exp(−|s−c_j|²/2σ²)
Smooth generalisation
More flexible than binary coding
Sensitive to prototype placement and σ

Kanerva Coding

Prototype-based: features measure Hamming distance to stored prototype states. Designed for large binary/discrete state spaces.

Good for high-dimensional binary inputs
Feature = similarity to prototype
Prototype selection matters greatly

Control with VFA: Extending to Q-Functions

1

Approximate Q-Function

Represent q̂(s,a;w) ≈ q^π(s,a) using a parameterised function. Input: state s (or state-action pair). Output: estimated action value.

2

Policy Improvement

Greedy or ε-greedy action selection based on estimated Q-values. Ensures exploration while gradually improving the policy.

3

On vs Off-Policy

On-policy (SARSA): use same policy for data and learning. Off-policy (Q-Learning): learn greedy policy while following exploratory behaviour.

Batch Methods: Least Squares Solutions

Instead of updating w one sample at a time, batch methods find the best fitting w over all experience at once.

LSTD — Least Squares TD

Closed-form solution — no step-size α needed
Converges directly to the TD fixed point
Much more sample-efficient than online TD
Extended to multi-step: LSTD(λ)
Extended to action values: LSTDQ

LSPI — Least Squares Policy Iteration

Interleaves LSTDQ with policy improvement (GPI)
Converges faster than incremental methods
Practical for moderate-sized problems
Replaces Q-table with LSTDQ at each iteration
Combines best of batch learning + policy iteration

Property	Incremental (Online)	Batch Methods (LSTD)
Implementation	Simple	More complex
Data efficiency	Each transition used once	Re-uses all experience
Memory	Constant (no storage)	Must store experience D
Step-size α	Required, sensitive	Not needed (closed-form)
Best for	Streaming data, online learning	Fixed dataset, sample efficiency critical

Deep Reinforcement Learning & DQN

      Deep neural networks replace hand-crafted features with learned representations. DQN was the breakthrough that made deep RL practical — by combining Q-learning with two key stability tricks.
    

Why Neural Networks?

🧮

Universal Approximator

Can represent any continuous function given enough capacity. No need to hand-craft features.

🧱

Distributed Representations

Exponentially fewer nodes than shallow networks for the same function. Deep = efficient.

👀

Learned Features

No hand-crafted features — features emerge from raw inputs (pixels, sensor readings). Learn directly from observations.

⚡

Gradient-Based Training

Learnable via stochastic gradient descent and backpropagation. Same machinery used in all of deep learning.

Convolutional Neural Networks (CNNs) in RL

Key architecture for processing spatial inputs (game screens, sensor grids, camera images).

📍 Local Structure

Not fully connected — each neuron sees only a local receptive field. Dramatically reduces parameters. Considers local spatial relationships.

🔁 Weight Sharing

All neurons in a feature map share the same weights. The same filter detects the same feature (e.g., edge) at different locations in the image.

🗺️ Feature Maps

A convolutional filter defines a feature map: all nodes detect the same feature across the input. Multiple filter banks learn diverse features simultaneously.

🗜️ Pooling

Sub-sampling operations compress feature maps, extracting salient information and favouring generalisation to new inputs.

What Transfers from Tabular RL — and What Doesn't

✅ What Transfers

TD and MC learning update rules
Double learning (Double Q-Learning)
Experience replay concept
ε-greedy exploration
SARSA / Q-learning structure

❌ What Doesn't Transfer Easily

Least squares TD/MC (closed-form becomes intractable)
Exact convergence guarantees
Tabular policy iteration (too many states)

Why Plain Q-Learning + NN is Unstable

Problem 1: Correlated Samples

Sequential transitions are highly correlated — violates the i.i.d. assumption of SGD. Causes oscillation and divergence during training.

t, t+1, t+2 transitions all share similar context

Problem 2: Non-Stationary Targets

As w updates, the TD target r + γ max Q(s',a';w) also shifts. The agent chases a moving target — nothing to anchor learning.

target changes every time w changes

DQN's Two Solutions

🔀 Experience Replay

Store transitions (s,a,r,s') in a replay buffer D. Sample random mini-batches to train. Breaks correlations, improves sample efficiency.

1Collect — Store (s,a,r,s') in buffer D via ε-greedy

2Sample — Random mini-batch from D breaks correlations

3Learn — SGD on sampled batch; minimise L(w)

4Repeat — Each transition replayed many times = sample efficiency

🧊 Fixed Q-Targets (Target Network)

Use a separate frozen network w⁻ to compute TD targets. Update w⁻ ← w only every C steps. Stabilises the target during updates.

y = r + γ · max_a' Q(s',a'; w⁻)

L(w) = 𝔼[(y − Q(s,a;w))²]

The target y doesn't move during learning — only w changes. Restart target sync every C steps.

DQN Architecture

State s
(pixels / sensors)

→

Q(s,a₁;w)
Q(s,a₂;w)
Q(s,a₃;w)

State s

→

TD Target y
= r + γ max Q(s',a';w⁻)

DQN Full Algorithm

Initialize replay buffer D, Q-network w (random), target network w⁻ ← w

For each episode:

Reset environment, observe initial state s

For each timestep t:

With prob ε: choose random action A (explore)

Otherwise: A = argmax_a Q(s,a;w) (exploit)

Execute A, observe reward R and next state S'

Store transition (S, A, R, S') in replay buffer D

Sample random mini-batch of transitions from D

Compute targets: y_i = r_i + γ · max_a'Q(s_i',a';w⁻) [or r_i if terminal]

Perform SGD step: minimise Σ(y_i − Q(s_i,a_i;w))² → update w

Every C steps: w⁻ ← w (sync target network)

      Convergence note: DQN has no formal convergence guarantees (deadly triad: bootstrapping + off-policy + nonlinear FA). In practice it works because replay + target network dampen instability enough for empirical convergence on many tasks.
    

Model-Based Reinforcement Learning

      A model Mη approximates the MDP dynamics (transitions + rewards) learned from experience. The agent then plans using the model — gaining sample efficiency by re-using experience without additional real-world interactions.
    

Three Approaches to Solving MDPs

Dynamic Programming

Assumes a complete, known model. No interaction needed. Requires full knowledge of P(s'|s,a) and R(s,a).

❌ Impractical for large or unknown environments

Model-Free RL

No model required. Interact with the environment directly. SARSA, Q-learning, MC. Works even when dynamics are unknown.

✅ Scales to complex real-world domains

Model-Based RL

Learn a model from experience, then plan using it. Dyna-Q combines both approaches. Model errors can compound over planning horizons.

⚖️ Balances data efficiency with adaptability

What is a Model?

A model M_η is an approximate representation of the MDP. States and actions are the same as the real problem. The dynamics are parameterised by weights η. Learning a model is a supervised learning problem: choose functional form → pick a loss function (e.g. MSE) → find η that minimises empirical loss.

Types of Models

Model Type	What it Predicts	Properties
Expectation Model	Expected next state: 𝔼[s_t+1\|s_t,a_t]	Simple, tractable; ignores stochasticity; mostly linear
Stochastic / Generative	Full distribution: s' ~ P(·\|s,a)	Captures uncertainty; supports risk-sensitive planning
Full Model	Both P(s'\|s,a) and R(s,a) jointly	Most expressive; hardest to learn accurately
Decomposed Dynamics	Separate η_T for transitions, η_R for rewards	Flexible; allows Table Lookup / Linear / NN sub-models

Learning vs Planning

Learning

The environment is initially unknown. The agent interacts with the environment to gather data and learn from real experience. Direct RL updates improve value / policy.

Planning

A model of the environment is given or has been learnt. The agent plans inside the model without further real interaction — simulates experience cheaply.

        An agent can learn a model from real experience and then plan within that model, combining the adaptability of learning with the efficiency of planning. This is the Dyna architecture.
      

Sample-Based Planning

Instead of solving the full MDP analytically, sample experience from the model and treat it as real experience. Any model-free RL algorithm can be applied to simulated data.

MC Control

Simulate full episodes from model. Compute returns. Update Q-values using MC estimates. Works well when model is accurate.

SARSA (On-Policy)

Simulate step-by-step transitions. Apply on-policy TD update. Works with partial episodes — more data-efficient than full MC.

Q-Learning (Off-Policy)

Sample transitions from model. Apply off-policy TD with max operator. Flexible — behaviour and target policies can differ.

Advantages of Sample-Based Planning:

Avoids expensive full DP sweeps — no need to enumerate all states
Can use any model-free algorithm unchanged on simulated data
Naturally handles large state spaces via sampling
Computationally flexible — run as many samples as budget allows
Combines smoothly with real experience (Dyna-Q architecture)

Handling Model Inaccuracy

Model-based RL is only as good as the estimated model. Performance is bounded by the optimal policy for the approximate MDP, not the true one.

1. Model-Free Fallback

When the model is wrong, fall back to model-free RL using real experience. Auto-detects divergence between model predictions and real outcomes.

2. Bayesian Uncertainty

Reason about model uncertainty using Bayesian methods (distribution over η). Plan under model uncertainty — balances exploration and exploitation at the model level.

3. Combined Approaches

Combine model-based and model-free in a single algorithm (e.g. Dyna-Q). Real experience for model updates + safety; model for planning.

Dyna-Q & Forward Search Planning

      Dyna-Q combines real experience (model-free Q-learning) with simulated experience (planning from a learned model). Both happen simultaneously every step — learning and planning in a unified loop.
    

Dyna-Q Algorithm — Step by Step

1

Real Step
Take action a_t in the real environment. Observe reward r_t+1 and next state s_t+1.

↓

2

Direct RL Update
Apply Q-learning directly from real experience: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') − Q(s,a)]

↓

3

Model Update
Update model M_η with observed transition (s_t, a_t, r_t+1, s_t+1). Supervised learning on dynamics.

↓

4

Planning (n steps)
Repeat n times: sample random (s,a) from memory, simulate s' ~ M_η, apply Q-learning update on simulated transition.

        Key advantage: More compute → more planning steps per real step → learns faster when real data is expensive, slow, or unsafe (robotics, medical trials, autonomous driving).
      

Integrating Model-Free and Model-Based

Model-Free RL (in Dyna-Q)

No model required for this part
Learn value function from real experience
More robust to model errors
Higher variance, less sample efficient
Algorithms: Q-learning, SARSA, A3C, PPO

Model-Based Planning (in Dyna-Q)

Learns model from real experience
Plans value function using simulated experience
More sample efficient
Requires accurate model — errors compound
Real experience is used for both model update and direct RL

Planning for Action Selection

Rather than improving a global value function, an agent can plan specifically to select the next action. The key insight: the distribution of states reachable from the current state differs from the global training distribution.

🎯 Local Accuracy

Focus all planning compute on states that will actually be encountered in the near future. Build a more accurate local value function for nearby states rather than the full state space.

🔍 Handles Inaccurate Models

Inaccuracies in the model may lead to interesting exploration of the local neighbourhood rather than propagating errors into the global value function.

⏱️ Anytime Planning

The agent can use however much compute is available before needing to act. More compute → better action selection — without retraining from scratch.

Forward Search

Select the best action by building a search tree rooted at the current state and looking ahead using the model. No need to solve the whole MDP — just the sub-MDP starting from now.

Root = Current State

The search tree starts at current state s_t. All branches represent possible future trajectories from here.

→

Model-Based Expansion

Use the learned or given model to expand tree nodes. At each node, simulate outcomes of each possible action.

→

Local Sub-MDP

Only solve the portion of the MDP reachable from now. Ignores irrelevant distant states — focused planning.

→

Best Action at Root

After building the tree, select the action at root with highest estimated value. Repeat each timestep.

Examples: Minimax search (chess), Monte Carlo Tree Search (MCTS, AlphaGo), Model Predictive Control (MPC) — all instances of forward search with different evaluation strategies.

Simulation-Based Search & MCTS

      MCTS is a simulation-based forward search — it builds a search tree from the current state using a model, focuses compute on the most promising branches, and is the backbone of AlphaGo/AlphaZero.
    

Simulation-Based Search

How It Works

Simulate K episodes starting from current state s_t
Use the model M_η as a simulator
Apply model-free RL to simulated episodes
Estimate Q(s_t, a) for each action a
Select action with highest estimated Q-value

Why It Works

Avoids enumerating all states explicitly
Sampling handles large/continuous action spaces
Model-free RL algorithms are reused unchanged
Can allocate compute where it matters most
No need for global value function — just local estimates

        Simulation policy flexibility: Can be random, learned, or hand-crafted. A random rollout policy still gives unbiased MC estimates of Q-values without any learned value function. MCTS improves on this by focusing simulations on promising subtrees.
      

Monte Carlo Tree Evaluation

Given model M_η and simulation policy π, simulate K episodes from current state s_t. Evaluate each action by mean return. Select action with maximum value.

📊 Unbiased Estimates

MC returns are unbiased estimates of the true value function — no bootstrapping errors, just variance. More simulations → lower variance.

📦 Black-Box Models

Only requires the ability to sample from the model — no need to know transition probabilities explicitly. Any simulator counts.

⚖️ Variance vs K

More simulations K → lower variance but higher compute. Trade-off between accuracy and efficiency. Choose K based on compute budget.

🎯 Simulation Policy Matters

A better simulation policy reduces variance and focuses simulations on relevant parts of the state space. Learned rollout policies accelerate convergence.

MCTS: 4-Step Algorithm

1. SELECT

Traverse the existing tree from root to a leaf node. At each step, pick actions according to Q(s,a) — balancing exploration and exploitation within the tree (e.g. UCB).

2. EXPAND

Add one new node to the tree by expanding the leaf state. Choose an untried action and create a child node. Grows the tree incrementally — one node per simulation.

3. ROLLOUT

From the new node, simulate a complete episode using the fixed rollout policy (often random) until termination. Provides an unbiased return estimate.

4. BACKUP

Propagate the return G back up the entire path from new node to root. Update Q(s,a) for all visited state-action pairs. Tree gets better with each simulation.

        Two simulation policies: (1) Tree policy — navigates the built tree, improves during search using Q(s,a). (2) Rollout policy — held fixed (often random), used to evaluate new nodes from outside the tree.
      

Why MCTS is One of the Most Powerful Planning Algorithms

Best-First Search

Concentrates simulations on the most promising branches. Unlike exhaustive methods, it doesn't waste compute on bad moves. The tree grows asymmetrically toward high-value regions.

Anytime Algorithm

Can be stopped at any point and will return the best action found so far. More compute = better decisions (a valid answer is always available). Ideal for real-time or resource-constrained settings.

Black-Box Compatible

MCTS only requires the ability to sample transitions — it does not need explicit P(s'|s,a). Any simulator counts as a valid model, making MCTS broadly applicable to real-world systems.

Breaks Curse of Dimensionality

Evaluates states dynamically through sampling rather than full DP sweeps. Scales to enormous state spaces (e.g. Go: 10¹⁷⁰ states). Only visited nodes are stored — memory proportional to tree, not state space.

Search Tree + Value Function Approximation: The Power Combination

Approach	Generalisation	Large State Spaces	Example
Model-Free RL (table lookup)	❌ None	❌ Cannot store all states	Q-table
Simulation-Based Search (table lookup)	❌ None	⚠️ Only reachable states	Basic MCTS
Simulation-Based Search + VFA	✅ Similar states get similar values	✅ Compact parameterised estimates	AlphaGo, AlphaZero

        AlphaGo / AlphaZero: MCTS + deep neural network value function + policy network. The NN generalises across similar board positions; MCTS focuses planning compute on the most promising lines of play. Together they defeated world Go champions — previously considered impossible for AI.
      

Glossary

Key terms and definitions from the RL course

Quiz

Test your knowledge across all topics · 25 Questions

25 Questions

All Topics

Self-paced

Cheat Sheet

Key formulas and algorithm summaries for quick reference

Return & Discount

G_t = Σ_k=0^∞ γ^k·R_t+k+1

G_t = R_t+1 + γ·G_t+1

γ=0: only immediate reward | γ=1: equal weight to all future rewards

MDP Tuple

(S, A, P, R, γ)

S States | A Actions | P Transitions
R Rewards | γ Discount

Value Functions

v_π(s) = 𝔼_π[G_t|S_t=s]

q_π(s,a) = 𝔼_π[G_t|S_t=s,A_t=a]

v_π(s) = Σ_aπ(a|s)·q_π(s,a)

Bellman Expectation

v_π(s) = Σ_aπ(a|s)Σ_s'P[R+γv_π(s')]

Bellman Optimality

v*(s) = max_aΣ_s'P[R+γv*(s')]

Q*(s,a) = R+γΣ_s'P·max_a'Q*(s',a')

Policy Iteration

Eval: V_k+1(s) ← Σ_aπ Σ_s'P[R+γV_k(s')]

Improve: π'(s) = argmax_aQ_π(s,a)

Repeat until stable.

Value Iteration

V_k+1(s) ← max_aΣ_s'P[R+γV_k(s')]

No separate eval step — directly uses optimality equation.

Monte Carlo

V(S_t) ← V(S_t) + α(G_t−V(S_t))

Model-free. Episodic only. Zero bias, high variance.

TD(0)

V(S_t) ← V(S_t) + α[R_t+1+γV(S_t+1)−V(S_t)]

δ_t = R_t+1+γV(S_t+1)−V(S_t) (TD error)

SARSA (On-Policy)

Q(S,A) ← Q(S,A) + α[R+γQ(S',A')−Q(S,A)]

Uses (S,A,R,S',A') — next action from same policy.

Q-Learning (Off-Policy)

Q(S,A) ← Q(S,A) + α[R+γ max_a'Q(S',a')−Q(S,A)]

Uses greedy max — independent of behaviour policy.

Double Q-Learning

Q(S,A) ← Q(S,A)+α[R+γQ*(S',argmax Q(S',a'))−Q(S,A)]

Decouples selection from evaluation — reduces overestimation bias in Q-Learning.

Linear VFA

v̂(s;w) = w^T·x(s)

w ← w + α[v_target−v̂(s,w)]·x(s)

x(s) = feature vector; w = learnable weights. Tabular is a special case.

DQN Loss

L(w) = 𝔼[(r+γ max Q(s',a';w⁻)−Q(s,a;w))²]

w = online network; w⁻ = frozen target network. Replay buffer + target net = stable training.

TD(n) Return

G_t⁽ⁿ⁾=R_t+1+γR_t+2+...+γ^n-1R_t+n+γⁿV(S_t+n)

TD(0)=1 step; TD(∞)=Monte Carlo. n controls bias-variance trade-off.

Dyna-Q

Real step: Q-update from real (s,a,r,s')

Model update: M_η ← (s,a,r,s')

Planning: n×Q-update from M_η samples

MCTS Steps

1. Select — traverse tree by Q(s,a)

2. Expand — add new leaf node

3. Rollout — simulate to terminal

4. Backup — propagate return G up

Algorithm Selection Guide

Do you have a model?

YES → Dynamic Programming (Policy/Value Iteration)

NO → Model-Free Methods ↓

Episodic only?

YES → Monte Carlo

NO → TD Methods ↓

Continuous actions?

YES → Actor-Critic (DDPG, SAC, PPO)

NO → Q-Learning / SARSA / DQN

Policy-Based Reinforcement Learning

Lecture 7 · Policy Gradients, Actor-Critic, PPO & Modern Algorithms · Inês Castelhano · May 2026

Three Approaches to Reinforcement Learning

      All RL algorithms fall into one of three paradigms — or combine them. Understanding the trade-offs helps you choose the right algorithm for any problem.
    

Approach	How it works	Advantages	Disadvantages
Model-Based RL Dyna-Q, AlphaZero, MBPO	Learn a model of the environment, then plan with it	Easy to learn a model (supervised learning). Learns 'all there is to know' from the data.	Uses compute on irrelevant details. Planning (computing policy from model) is non-trivial and expensive.
Value-Based RL Q-Learning, DQN, SARSA	Learn V(s) or Q(s,a), derive policy greedily	Easy to generate policy (just take argmax). Close to true objective. Well-understood algorithms exist.	Still not the true objective. Small value error can lead to larger policy error. Struggles with continuous actions.
Policy-Based RL REINFORCE, PPO, SAC, DDPG	Directly parametrise and optimise π_θ(a\|s)	Directly optimises the true objective. Easily extended to continuous action spaces. Can learn stochastic policies.	Could get stuck in local optima. Specific knowledge may not generalise well.

Actor-Critic: The Best of Both Worlds

Actor-Critic methods combine value-based (critic) and policy-based (actor) components. The critic estimates the value function; the actor directly optimises the policy using the critic's signal. This reduces the high variance of pure policy gradient while keeping the direct policy optimisation.

Δθ = α · ∇_θ log π_θ(a|s) · Q_w(s,a)

Policy-Based Reinforcement Learning

      Instead of learning a value function and deriving a policy from it, policy-based RL directly parametrises the policy πθ(a|s) — a probability distribution over actions given a state — and optimises θ by gradient ascent.
    

✅ Advantages

True objective: directly maximises expected cumulative return — no intermediary value function
Continuous actions: trivially extends to infinite/continuous action spaces
Stochastic policies: can learn genuinely random policies (optimal in partially observable settings)
Simpler in some domains: sometimes policies are simple while values are complex

❌ Disadvantages

Local optima: gradient ascent can converge to a suboptimal policy
Poor generalisation: obtained knowledge can be specific, doesn't always transfer
Doesn't extract all info: when used in isolation, doesn't leverage value function information as efficiently
High variance: policy gradient estimates have high variance — needs many samples

Why Continuous Action Spaces are Hard for Value-Based Methods

🗂️ Cannot Tabulate Q-Values

The action space is infinite — we cannot store Q(s,a) for every a. A function approximator must map (state, action) pairs to values, requiring a network that takes both as input.

🔍 Greedy Max is Intractable

Computing argmax_a Q(s,a) requires solving an inner optimisation problem at every single step. Expensive and often intractable for high-dimensional action spaces.

🎲 Exploration is Hard

Exploration in high-dimensional continuous spaces is challenging. Gaussian policies provide natural exploration via variance σ² without explicit ε-greedy schemes.

        Policy-based solution: When directly updating θ, continuous actions are easy — the Gaussian score function ∇θ log πθ(a|s) = (a−μ(s))φ(s)/σ² is always available in closed form.
      

Policy Objective Functions

Given policy π_θ(a|s), find the best parameters θ. We need a scalar measure of policy quality:

Episodic Return Objective

In episodic environments: maximise average total return per episode.

J_G(θ) = 𝔼_{π_θ}[Σ γᵗ R_t+1]

Used when episodes have a clear start and end (Atari, MuJoCo tasks).

Average Reward Objective

In continuing environments: maximise average reward per timestep.

J_R(θ) = lim_T→∞ ¹⁄_T 𝔼_{π_θ}[Σ R_t]

Used for continuous control without natural episode boundaries.

        Policy Optimisation: Policy-based RL is an optimisation problem — find θ that maximises J(θ). We use stochastic gradient ascent: θ ← θ + α · ∇θJ(θ). The challenge: computing ∇θJ(θ) efficiently from samples.
      

Stochastic Policies

      A stochastic policy πθ(a|s) outputs a probability distribution over actions rather than a single deterministic action. This is often advantageous or even necessary.
    

1

Deterministic Optimal Exists (Full Observability)

In fully observable MDPs, there is always an optimal deterministic policy π*(s) = a*. Stochasticity is not needed — but it may still help during training via exploration.

Example: Frozen Lake (fully observable grid) has an optimal deterministic path

2

Stochastic Optimal (Partial Observability)

Most real-world problems are not fully observable. Two different true states may look identical to the agent. The optimal policy must then be stochastic — mixing actions covers the ambiguity.

Example: Poker — same hand can be played differently to prevent opponents from reading your strategy

3

Smoother Optimisation Landscape

The search space is smoother for stochastic policies. Deterministic policies can have sharp discontinuities — small parameter changes cause sudden policy jumps. Stochastic policies provide gradients everywhere.

Crucial for gradient-based optimisation — soft policies are differentiable everywhere

4

Built-in Exploration

Stochastic policies provide automatic exploration during learning without requiring separate ε-greedy logic. The policy entropy naturally encourages trying different actions.

SAC explicitly maximises entropy as part of its objective: J(π) = Σ 𝔼[r] + α·H(π(·|s))

Stochastic vs Deterministic Policies

Property	Deterministic π(s) = a	Stochastic π(a\|s)
Optimal for fully observable MDP	✓ Yes	✓ Also works
Optimal for partially observable	✗ May not be	✓ Yes (mixing)
Gradient-based optimisation	⚠ Discontinuities	✓ Smooth everywhere
Built-in exploration	✗ Needs ε-greedy	✓ Via entropy
Used in DPG / TD3	✓ Yes (DDPG, TD3)	✓ Yes (PPO, SAC)

Computing the Policy Gradient

      We want ∇θJ(θ) so we can do gradient ascent. The challenge: J(θ) is an expectation over trajectories — we can't differentiate through the environment. The log-derivative trick solves this.
    

Step 1: The Policy Gradient Problem

We want to compute:

∇_θJ(θ) = ∇_θ 𝔼_{π_θ}[G_t] = ∇_θ Σ_τ P(τ;θ) · G(τ)

Problem: ∇_θ P(τ;θ) is hard — it requires knowing the environment dynamics.

Step 2: The Log-Derivative Trick

Using the identity: ∇_θ P(τ;θ) = P(τ;θ) · ∇_θ log P(τ;θ)

∇_θJ(θ) = 𝔼_{π_θ}[∇_θ log P(τ;θ) · G(τ)]

Now it's an expectation of gradients — we can estimate it by sampling trajectories!

🎯 The Score Function

The score function is the derivative of the log policy:

∇_θ log π_θ(a|s)

Using the log-derivative trick, this gives us the policy gradient:

∇_θJ(θ) = 𝔼_{π_θ}[∇_θ log π_θ(A_t|S_t) · G_t]

Intuitively: increase the log-probability of actions that led to good returns, decrease it for bad returns.

REINFORCE Algorithm (Monte Carlo Policy Gradient)

REINFORCE

Initialise policy parameters θ randomly

For each episode:

Generate full episode: S₀,A₀,R₁,S₁,A₁,R₂,...,S_T following π_θ

For each timestep t = 0,1,...,T−1:

Compute return G_t = Σ_k=t^T γ^k−tR_k+1

Update: θ ← θ + α · ∇_θ log π_θ(A_t|S_t) · G_t

Why This Works

G_t acts as a scalar weight — positive returns reinforce actions, negative penalise
Unbiased estimate of the true gradient
No model of the environment needed
Works for any differentiable policy class

The Variance Problem

G_t can vary hugely across episodes — high variance
Learning is slow and noisy
Baseline trick: subtract a baseline b(s) from G_t to reduce variance without bias: θ ← θ + α · ∇_θ log π_θ(A_t|S_t) · (G_t − b(S_t))
Best baseline: the value function V(s) → leads to Actor-Critic

Differentiable Policy Classes

      Two canonical differentiable policy parametrisations — one for discrete actions (Softmax) and one for continuous actions (Gaussian). Both have closed-form score functions.
    

🔢 Softmax Policy — Discrete Actions

Weight actions using a linear combination of features φ(s,a). Probability proportional to exponentiated weight:

π_θ(a|s) = ^{exp(φ(s,a)^Tθ)}⁄_{Σ_a'exp(φ(s,a')^Tθ)}

Score Function:

∇_θ log π_θ(a|s) = φ(s,a) − 𝔼_{π_θ}[φ(s,·)]

The score is: the feature of the chosen action minus the expected feature under the current policy. Pushes θ toward features of rewarding actions.

✅ Discrete: card games, Atari, NLP token selection

📈 Gaussian Policy — Continuous Actions

Mean is a linear combination of state features. Variance σ² controls exploration:

π_θ(a|s) = 𝒩(μ(s), σ²)

μ(s) = φ(s)^Tθ

Score Function:

∇_θ log π_θ(a|s) = ^{(a − μ(s))φ(s)}⁄_σ²

Updating θ shifts the mean of the Gaussian toward rewarding actions. Variance σ² controls built-in exploration — larger σ = more exploration.

✅ Continuous: robotics, driving, joint torques

Comparison

Property	Softmax (Discrete)	Gaussian (Continuous)
Action space	Finite set {a₁, a₂, ..., aₙ}	ℝ or bounded interval
Output distribution	Categorical (sums to 1)	Normal distribution 𝒩(μ, σ²)
Exploration control	Temperature / entropy regularisation	Variance σ² parameter
Score function closed-form	✓ Yes	✓ Yes
Greedy policy	argmax probability = action with highest weight	Take mean μ(s) deterministically
Example algorithms	REINFORCE, A2C, PPO (discrete)	PPO, SAC, DDPG, TD3

Advantage Function: Reducing Variance

Instead of weighting by raw return G_t, use the advantage function A(s,a) = Q(s,a) − V(s). This tells us how much better action a is compared to the average:

∇_θJ(θ) = 𝔼_{π_θ}[∇_θ log π_θ(A_t|S_t) · A(S_t,A_t)]

        Using A(s,a) instead of Gt dramatically reduces variance. The advantage is centred around zero — it's positive for better-than-average actions and negative for worse-than-average ones. This is the foundation of Actor-Critic methods.
      

Actor-Critic Methods

      Actor-Critic combines a policy (actor) with a value function (critic). The critic evaluates the current policy; the actor improves the policy using the critic's signal. Together they reduce variance without adding bias.
    

🎭

Actor

Parameters θ

Policy π_θ(a|s)

Selects actions

θ ← θ + α · ∇_θ log π_θ(a|s) · Q_w(s,a)

State s, Action a chosen by Actor

→

←

Q_w(s,a) feedback to Actor

📊

Critic

Parameters w

Value function Q_w(s,a)

Evaluates actions

w ← w + β · δ · ∇_wQ_w(s,a)

The Training Loop

1Actor acts: Observe state S, Actor selects action A ~ π_θ(·|S)

2Environment responds: Receive reward R, observe next state S'

3Critic evaluates: Compute TD error δ = R + γ·V_w(S') − V_w(S). Update Critic: w ← w + β·δ·∇_wV_w(S)

4Actor improves: Update Actor using Critic's signal: θ ← θ + α·∇_θ log π_θ(A|S) · δ

Critic Options: What the Critic Estimates

Critic Target	Formula	Properties
MC Return	G_t	Unbiased, high variance, must wait for episode end
TD(0) V-function	R + γV(S')	Low variance, some bias, online updates
Q-function Q(s,a)	R + γQ(S',A')	Per-action signal, used in Q-actor-critic
Advantage A(s,a)	Q(s,a) − V(s)	Centred around 0, lower variance than raw return
GAE (Generalised Adv.)	Σ (γλ)ᵏδ_t+k	Interpolates TD and MC, used in PPO

PPO & Modern Actor-Critic Algorithms

      Vanilla policy gradient can take update steps that are too large, collapsing performance. Modern algorithms add stability constraints and entropy regularisation to make training robust.
    

PPO — Proximal Policy Optimisation

Core Problem

Vanilla policy gradient: a large update step can dramatically change the policy and collapse performance. Once in a bad region, it's hard to recover.

r(θ) = ^π_θ(a|s)⁄_{π_old(a|s)}

This ratio r(θ) measures how much the policy has changed from the last update.

PPO's Solution: Clipped Objective

Clip r(θ) within [1−ε, 1+ε] (typically ε = 0.2). This limits how far the new policy can deviate from the old one:

L^CLIP(θ) = 𝔼[min(r(θ)Â, clip(r(θ), 1−ε, 1+ε)Â)]

Where Â is the advantage estimate. The clip prevents exploiting large policy ratio updates.

Entropy Bonus

PPO adds an entropy bonus to the objective to prevent premature convergence to a deterministic policy:

L(θ) = L^CLIP(θ) + c · H(π_θ(·|s))

Higher entropy = more exploration. c is a tunable coefficient.

Why PPO is the Industry Default

Simple to implement (no second-order KL constraint)
Works for both discrete and continuous actions
Stable enough for RLHF fine-tuning of LLMs (ChatGPT)
Better sample efficiency than REINFORCE
Strong baseline across diverse environments

Modern Actor-Critic Algorithm Family

A2C / A3C

Asynchronous Advantage Actor-Critic. A3C runs parallel workers asynchronously for faster training. A2C is the synchronous version — simpler and often just as effective.

On-policy: data from current policy only
Multiple parallel environments → diversity
Advantage function reduces variance

Atari, continuous control

PPO

Proximal Policy Optimisation. Clips the probability ratio to prevent destructive updates. Industry default for continuous control and RLHF fine-tuning of LLMs.

On-policy with clipped surrogate loss
Entropy bonus for exploration
Stable and scalable

Robotics, OpenAI Five, ChatGPT RLHF

SAC

Soft Actor-Critic. Off-policy method that balances expected reward with a policy entropy term — encouraging exploration and preventing premature convergence.

Off-policy: replay buffer → sample efficient
Maximum entropy framework: J(π) = Σ 𝔼[r + αH(π)]
Automatic entropy tuning

Continuous control, dexterous manipulation

TD3

Twin Delayed DDPG. Uses two critic networks and takes the minimum Q-value to reduce overestimation bias. Actor updates less frequently than critics for stability.

Two Q-networks → min Q → less overestimation
Delayed actor updates (every 2 critic steps)
Target policy smoothing (adds noise to actions)

MuJoCo locomotion, continuous control

Algorithm Comparison: When to Use What

Algorithm	On/Off Policy	Actions	Key Trick	Best For
REINFORCE	On-policy	Both	MC returns	Simple baselines, teaching
A2C / A3C	On-policy	Both	Parallel workers	Fast training, Atari
PPO	On-policy	Both	Clipped ratio	Most tasks — strong default
DDPG	Off-policy	Continuous	Deterministic actor	Simple continuous control
TD3	Off-policy	Continuous	Twin critics + delay	Continuous control
SAC	Off-policy	Continuous	Max entropy	Sample-efficient continuous control

        Rule of thumb: Discrete actions → PPO. Continuous actions + sample efficiency matters → SAC. Continuous actions + simplicity → TD3. Need reproducible stable baseline → PPO. Fine-tuning an LLM → PPO.
      

📖 Deep Dives

Book-derived scientific depth — theory, proofs, and formal frameworks from Deep Reinforcement Learning with Python (Sanghi)

Convergence Properties of RL Algorithms

Not all RL algorithms are guaranteed to converge. The convergence behaviour depends on three key factors: whether bootstrapping is used, whether learning is on-policy or off-policy, and whether the function approximator is linear or nonlinear. This table (adapted from Sutton & Barto and Sanghi Chapter 5) summarises the landscape:

Algorithm	Bootstrapping	On/Off Policy	Tabular	Linear FA	Nonlinear FA
Monte Carlo	No	On-Policy	✓ Yes	✓ Yes	✓ Yes
TD(0)	Yes	On-Policy	✓ Yes	✓ Yes	⚠ No
TD(0)	Yes	Off-Policy	✓ Yes	⚠ No	✗ No
Q-Learning	Yes	Off-Policy	✓ Yes	⚠ No	✗ No
SARSA	Yes	On-Policy	✓ Yes	✓ Yes	⚠ No
DQN	Yes	Off-Policy	—	—	⚠ Empirical

⚠️

The Deadly Triad

Convergence failures occur when all three of these combine simultaneously:

Bootstrapping — using estimates to update estimates
Off-policy — behaviour policy ≠ target policy
Function approximation — generalising across states

Each individually is fine. Two together is often fine. All three → potential divergence.

✅

Why Monte Carlo Always Converges

MC uses no bootstrapping — targets are actual returns G_t, not estimated values. This means:

The update target is stationary (doesn't move as weights change)
The update is a true gradient of the MSE loss
Stochastic gradient descent convergence theorems apply

The cost: high variance, no online learning, must wait for episode end.

📐

Linear FA + On-Policy TD

This combination converges to the TD fixed point — a point near (but not at) the true MSE minimum. The error bound is:

‖v̂_w − v_π‖²_μ ≤ ¹⁄_(1−γ) · min_w ‖v̂_w − v_π‖²_μ

The TD fixed point lies within a bounded factor of the best linear approximation possible.

      Key Takeaway: DQN "works" empirically by breaking the deadly triad with two tricks: experience replay (decorrelates off-policy data) and target network (slows the moving target). Neither fully eliminates the triad — they just dampen its instability enough for practical learning.
    

Semi-Gradient Methods: The Formal Justification

When we use bootstrapped targets with function approximation, we face a fundamental problem: the update target itself depends on the weights w. This makes true gradient descent impossible.

The True Gradient Objective

We want to minimise the Mean Squared Value Error:

J(w) = Σ_s μ(s) · [v_π(s) − v̂(s, w)]²

Taking the true gradient:

∇_wJ(w) = −2 Σ_s μ(s) · [v_π(s) − v̂(s, w)] · ∇_wv̂(s, w)

The Problem with TD Targets

In TD(0), we replace v_π(s) with the bootstrap target:

U_t = R_t+1 + γ · v̂(S_t+1, w)

Notice: the target U_t depends on w! So the true gradient would be:

∇_w[U_t − v̂(S_t, w)]² = −2[U_t − v̂(S_t, w)] · [∇_wv̂(S_t, w) − γ · ∇_wv̂(S_t+1, w)]

This requires differentiating through both v̂(S_t, w) and v̂(S_t+1, w).

🎯 The Semi-Gradient Approximation

Instead, we ignore the derivative of the next-state value. We treat U_t as if it were a fixed target that doesn't depend on w:

w ← w + α · [U_t − v̂(S_t, w)] · ∇_wv̂(S_t, w)

Only the current state's gradient is computed. The next-state term is dropped.

❓

Why Drop the Next-State Gradient?

The target U_t is used as a "label" — differentiating through it would create a circular dependency
This mirrors how supervised learning treats targets as fixed
For linear FA on-policy, the resulting update still converges (to the TD fixed point)
This is exactly what deep learning frameworks do: target_net.detach() in PyTorch stops gradient flow through the target network

📊

Monte Carlo = True Gradient

MC is the only method that performs true gradient descent on J(w), because the return G_t does not depend on w:

w ← w + α · [G_t − v̂(S_t, w)] · ∇_wv̂(S_t, w)

Here G_t is a fixed observed return. No approximation is made — this is true SGD. That's why MC converges everywhere.

Semi-Gradient TD vs Full-Gradient (Residual Gradient)

Property	Semi-Gradient TD	Full-Gradient (Residual)
Gradient computation	∇v̂(S_t, w) only	∇v̂(S_t, w) − γ·∇v̂(S_t+1, w)
Convergence (linear, on-policy)	✓ Yes (TD fixed point)	✓ Yes (true minimum)
Convergence (nonlinear)	✗ Not guaranteed	⚠ Slow, unstable
Computational cost	One forward pass	Two forward passes
Used in practice	✓ Always (DQN, etc.)	✗ Rarely

Backup Diagrams: Formal Theory

Backup diagrams are a compact visual language for describing how value information flows in different RL algorithms. Each diagram encodes exactly which transitions are considered and how deep the lookahead goes.

Dynamic Programming

S

a₁

s'

a₂

s'

V(s) = Σ_a π(a|s) Σ_s' p(s'|s,a)[r + γV(s')]

Full-width: considers ALL actions
Full-width: considers ALL next states
Depth 1: only one step lookahead
Uses model (transition probabilities)
Bootstraps from V(s')

Monte Carlo

S_t

a

s'

a

s'

a

s'

⋮

T (terminal)

V(s) ← G_t = R_t+1 + γR_t+2 + … + γ^T−tR_T

Sample-width: only ONE sampled trajectory
Full depth: all the way to episode end
No bootstrapping — actual return used
No model needed (model-free)
High variance, zero bias

TD(0)

S_t

a

S_t+1

V(S_t) ← R_t+1 + γ·V(S_t+1)

Sample-width: ONE sampled transition
Depth 1: stops at next state
Bootstraps from V(S_t+1)
Low variance, some bias
Online learning possible

TD(n)

S_t

a

s'

a

s'

⋮ (n steps)

S_t+n

G_t⁽ⁿ⁾ = R_t+1 + … + γⁿ⁻¹R_t+n + γⁿV(S_t+n)

Sample-width: ONE trajectory
Depth n: n-step lookahead
Bootstraps at step n
Interpolates MC↔TD(0) via n
n=1: TD(0); n=∞: MC

      The Spectrum: All RL algorithms can be understood as operating along two axes — depth (1-step to full episode) and width (full-width expectation vs single-sample). DP is shallow+wide, MC is deep+narrow, TD is shallow+narrow. TD(n) moves along the depth axis. Planning algorithms like Dyna-Q mix both.
    

Off-Policy Monte Carlo Control

Off-policy learning separates the policy used to generate experience (behaviour policy b) from the policy being optimised (target policy π). This enables learning about the optimal policy while still exploring.

🎭

Two Policies

Property	Behaviour Policy b	Target Policy π
Role	Generates episodes	Being improved
Must cover π?	Yes (coverage condition)	—
Typical form	ε-soft (explores)	Greedy (deterministic)
Used at test time?	No	Yes

📋

Coverage Condition

For off-policy MC to work, the behaviour policy must cover the target policy:

π(a|s) > 0 ⟹ b(a|s) > 0

Every action the target might take must have non-zero probability under the behaviour policy. Otherwise returns from some trajectories would be missing from the estimate.

Importance Sampling

Episodes are generated under b, but we want to evaluate π. Returns G_t from b must be reweighted by the probability ratio:

ρ_t:T = ∏_k=t^T−1 ^π(A_k|S_k)⁄_{b(A_k|S_k)}

This ratio ρ is the importance sampling ratio — how much more (or less) likely the trajectory was under π compared to b.

⚖️

Ordinary IS

V(s) = ^{Σ_t∈T(s) ρ_t:TG_t}⁄_|T(s)|

Pros: Unbiased estimate of V^π(s)

Cons: High variance (ρ can be very large or very small)

⚖️

Weighted IS

V(s) = ^{Σ_t∈T(s) ρ_t:TG_t}⁄_{Σ_t∈T(s) ρ_t:T}

Pros: Much lower variance; consistent estimator

Cons: Biased (but bias → 0 as samples → ∞)

Preferred in practice — variance reduction is usually worth the small bias.

      Why Q-Learning is Simpler: Q-Learning achieves off-policy learning without importance sampling by using a one-step bootstrap target: maxa Q(S', a). This directly "imagines" the greedy action regardless of what the behaviour policy did — but at the cost of introducing the deadly triad risk with function approximation.
    

Incremental Off-Policy MC Update (Weighted IS)

Using an incremental weighted average (avoids storing all returns):

C_n+1 = C_n + W_n

Q_n+1(s,a) = Q_n(s,a) + ^W_n⁄_{C_n+1} · [G_n − Q_n(s,a)]

W_n+1 = W_n · ^π(A_t|S_t)⁄_{b(A_t|S_t)}

Early exit: if the behaviour action diverges from the greedy target action, W → 0 and the episode no longer contributes. This stops learning on trajectories irrelevant to π.

Algorithm Landscape

The RL algorithm space can be mapped along multiple axes. Understanding where each algorithm sits helps you pick the right tool for a given problem.

Axis 1: Model-Free vs Model-Based

Model-Free

No environment model
Learn directly from experience
Q-Learning, SARSA, DQN, PPO
More data hungry
Simpler to implement

←→

Model-Based

Learn/use a model of dynamics
Can plan with simulated rollouts
Dyna-Q, AlphaZero, World Models
More sample efficient
Model errors can compound

Axis 2: Value-Based vs Policy-Based

Value-Based

Learn V(s) or Q(s,a)
Policy is implicit (greedy)
Q-Learning, DQN, SARSA
Works best with discrete actions

←→

Policy-Based

Directly optimise π(a|s;θ)
No explicit value function
REINFORCE, PPO, SAC, DDPG
Works with continuous actions

Actor-Critic (Both) — learns both V(s) and π(a|s;θ)

Axis 3: On-Policy vs Off-Policy

On-Policy

SARSA, Monte Carlo, PPO
Must use current policy's data
More stable learning
Less sample efficient

←→

Off-Policy

Q-Learning, DQN, SAC, DDPG
Can reuse old experience (replay buffer)
Deadly triad risk
More sample efficient

Axis 4: Bootstrapping Depth

TD(0)
1-step

TD(n)
n-step

Monte Carlo
∞-step

TD(λ) provides an elegant weighted combination of all n-step returns simultaneously using eligibility traces.

Algorithm Map

Discrete Actions

Continuous Actions

Model-Free
Value-Based

Q-Learning, SARSA, DQN, Double DQN, Dueling DQN

— (not directly applicable)

Model-Free
Policy-Based

REINFORCE, PPO (discrete)

REINFORCE, PPO, SAC, DDPG, TD3

Model-Free
Actor-Critic

A3C, PPO (discrete)

A3C, PPO, SAC, DDPG

Model-Based

Dyna-Q, MCTS (AlphaGo/Zero)

World Models, Dreamer, MuZero

Generalised Policy Iteration with Function Approximation

Generalised Policy Iteration (GPI) is the universal framework underlying almost all RL algorithms. It describes the interplay between policy evaluation and policy improvement.

📊

Policy Evaluation

Given π, estimate V^π(s) or Q^π(s,a)

MC Returns TD(0) TD(n) LSTD

makes greedy w.r.t. V

→

←

evaluates improved π

🎯

Policy Improvement

Given Q(s,a), derive a better policy

Greedy ε-greedy Softmax UCB

GPI + Function Approximation

With tabular methods, GPI is well-understood and converges to the optimal policy. With function approximation, things become more complex:

🔄

What Changes with VFA

Evaluation is approximate: V̂(s, w) ≠ V^π(s) exactly
Improvement causes distribution shift: changing π changes which states are visited (μ changes)
Generalisation cuts both ways: updating one state affects others via shared weights
No guarantee of convergence to exact optimum — GPI finds a near-optimal policy

📐

Tabular Lookup as Special Case of Linear FA

Tabular RL is mathematically a special case of linear function approximation where the feature vector is a one-hot encoding:

x(s) = e_s = [0, …, 1, …, 0]

V̂(s, w) = x(s)ᵀw = w_s

Each weight w_s is exactly the value of state s. No generalisation occurs — each state is independent. This proves tabular methods are a special case of the general VFA framework.

⚡

Semi-Gradient SARSA (GPI in Practice)

w ← w + α[R + γQ̂(S', A', w) − Q̂(S, A, w)] · ∇_wQ̂(S, A, w)

This is GPI in action: the TD target serves as the "evaluation" signal, while the ε-greedy policy derived from Q̂ is the "improvement" step. Both happen simultaneously every timestep.

🧠

DQN as GPI

DQN implements GPI at scale:

Evaluation: minimise loss (Q̂(S,A,w) − y)² where y = R + γ·max_aQ̂(S', a, w⁻)
Improvement: ε-greedy w.r.t. Q̂(·, ·, w)
Target network w⁻: stabilises the evaluation target
Replay buffer: decorrelates samples for stable SGD

      The Big Picture: Every major RL algorithm — DP, MC, TD, Q-Learning, SARSA, DQN, PPO, SAC — is an instance of GPI. They differ only in how they do policy evaluation (depth, width, bootstrapping) and how they represent the policy (tabular, linear FA, neural net). Understanding GPI means understanding all of RL at once.
    

Reinforcement Learning

Intro to RL

MDPs & Bellman

DP & Monte Carlo

Model-Free Methods

Model-Free Control

VFA & Planning

Policy-Based RL

Quiz

Your Progress

Introduction to Reinforcement Learning

What is Reinforcement Learning?

RL vs Other Machine Learning Paradigms

The Agent–Environment Interaction Loop

Reward Hypothesis

Reward Rt

Return Gt

Discount Factor γ (Gamma)

The Four Pillars of RL

1. Optimisation

2. Delayed Consequences

3. Exploration

4. Generalisation

What Makes RL Different?

Core Agent Components

Prediction vs. Control

Prediction

Control

Learning vs. Planning

Learning

Planning

Observability, State & Policy Types

Environment State vs. Agent State

🌍 Environment State

🤖 Agent State

History: The Full Record

Fully Observable vs. Partially Observable

👁️ Fully Observable (MDP)

🙈 Partially Observable (POMDP)

Policy Types

Deterministic Policy

Stochastic Policy

Exploration vs. Exploitation

Exploration

Exploitation

Interactive: Explore vs. Exploit

Agent Categories

By Internal Components

By Use of a Model

Model-Free

Model-Based

Markov Decision Processes

The Markov Property

Full Observability (MDP)

Partial Observability (POMDP)

MDP: The Formal Definition

Delivery Driver Mapping

Episodic vs Continuing Tasks

Episodic Tasks

Continuing Tasks

Value Functions

State Value Function Vπ(s)

Action Value Function Qπ(s,a)

Optimal Value Functions

v*(s)

Q*(s,a)

Bellman Equations

Bellman Expectation Equations

For Vπ

For Qπ

Bellman Optimality Equations

Optimal V*

Optimal Q*

Breaking Down the Bellman Equation

Frozen Lake — MDP Example

MDP Components

Dynamic Programming & Monte Carlo

Dynamic Programming

Optimal Substructure

Overlapping Subproblems

Reward R_t

Return G_t

State Value Function V_π(s)

Action Value Function Q_π(s,a)

For V_π

For Q_π