Skip to content

Tabular Q-Learning & SARSA

Q-learning is the canonical model-free RL algorithm. It estimates the optimal action-value function Q(s,a) from sampled transitions, without ever needing to know the transition probabilities P or the reward function r explicitly. Together with its on-policy cousin SARSA, it is the algorithmic prototype for DQN and most of modern value-based RL.

The Q-learning update

Recall the Bellman optimality equation Q(s,a)=r(s,a)+γEs[maxaQ(s,a)] from MDPs. Q-learning replaces the expectation with a single sample (s,a,r,s) and applies a stochastic-approximation update:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)].

The bracketed quantity is the temporal-difference (TD) error. Watkins (1989) proved convergence to Q under standard stochastic-approximation conditions: every (s,a) visited infinitely often, learning rate satisfying tαt=, tαt2<, and bounded rewards. This is the first general convergence guarantee for model-free RL.

The crucial point: the update does not require knowing the policy that produced the transitions. Q-learning is off-policy — you can learn the optimal policy from data generated by a different (e.g., random, exploratory) policy.

SARSA — on-policy TD control

SARSA uses the actually-taken next action a instead of the greedy one:

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)].

This estimates Qπ for the current behaviour policy π. The acronym is the five-tuple (s,a,r,s,a). SARSA is on-policy — it learns the value of the policy actually being followed.

The practical difference: SARSA is more conservative in stochastic or risky environments. The classical example is the cliff-walk grid: with ϵ-greedy exploration, Q-learning learns the optimal near-cliff path (high return when behaving greedily, occasional disasters when exploring), while SARSA learns a safer further-from-cliff path (lower return greedily, but rarely falls when exploring). Both are correct — they just optimise different objectives.

Exploration: ϵ-greedy and beyond

A learner that always picks argmaxaQ(s,a) never visits actions that look bad early — it cannot recover from initial value-function errors. The simplest fix is ϵ-greedy: with probability ϵ choose a uniform random action, otherwise be greedy. Decay ϵ from 1 to ~0.05 over training.

Better exploration strategies — Boltzmann (softmax over Q), UCB (upper confidence bound), Thompson sampling, intrinsic-motivation rewards — matter more in deep RL, where the value function is parameterised and can produce systematically biased estimates. In tabular Q-learning with small state spaces, ϵ-greedy with a slow decay is usually fine.

Bootstrapping and the deadly triad

Q-learning's update is bootstrapped — the target r+γmaxaQ(s,a) uses the current estimate Q, not a reference truth. This is the algorithmic source of much of RL's instability. Sutton's deadly triad names the three ingredients that combine into divergence in deep RL:

  1. Function approximationQ parameterised by a network.
  2. Bootstrapping — TD target uses current estimate.
  3. Off-policy learning — data does not come from the policy being evaluated.

In tabular Q-learning the first ingredient is missing, so convergence is fine. In DQN, all three combine and require concrete fixes (target networks, experience replay, gradient clipping).

Maximisation bias and Double Q-Learning

The max in Q-learning's target is biased upward when Q has noise: E[max(Q^(s,a))]>maxE[Q^(s,a)]. Double Q-Learning (Hasselt, NIPS 2010) maintains two estimators QA,QB and decouples action selection from action evaluation:

QA(s,a)QA(s,a)+α[r+γQB(s,argmaxaQA(s,a))QA(s,a)].

This removes the systematic over-estimation. The trick lifts directly to Double DQN (van Hasselt et al., AAAI 2016) — the current default in deep value-based RL.

Released under the MIT License. Content imported and adapted from NoteNextra.