Tabular Q-Learning & SARSA
Q-learning is the canonical model-free RL algorithm. It estimates the optimal action-value function
The Q-learning update
Recall the Bellman optimality equation
The bracketed quantity is the temporal-difference (TD) error. Watkins (1989) proved convergence to
The crucial point: the update does not require knowing the policy that produced the transitions. Q-learning is off-policy — you can learn the optimal policy from data generated by a different (e.g., random, exploratory) policy.
SARSA — on-policy TD control
SARSA uses the actually-taken next action
This estimates
The practical difference: SARSA is more conservative in stochastic or risky environments. The classical example is the cliff-walk grid: with
Exploration: -greedy and beyond
A learner that always picks
Better exploration strategies — Boltzmann (softmax over
Bootstrapping and the deadly triad
Q-learning's update is bootstrapped — the target
- Function approximation —
parameterised by a network. - Bootstrapping — TD target uses current estimate.
- Off-policy learning — data does not come from the policy being evaluated.
In tabular Q-learning the first ingredient is missing, so convergence is fine. In DQN, all three combine and require concrete fixes (target networks, experience replay, gradient clipping).
Maximisation bias and Double Q-Learning
The
This removes the systematic over-estimation. The trick lifts directly to Double DQN (van Hasselt et al., AAAI 2016) — the current default in deep value-based RL.
What to read next
- DQN — Q-learning with a deep network and the engineering tricks that made it stable.
- Policy Gradient — the alternative algorithmic family.
- MDPs & Bellman Equations — what Q-learning approximates.