Tabular Q-Learning & SARSA

Q-learning is the canonical model-free RL algorithm. It estimates the optimal action-value function $Q^{*} (s, a)$ from sampled transitions, without ever needing to know the transition probabilities $P$ or the reward function $r$ explicitly. Together with its on-policy cousin SARSA, it is the algorithmic prototype for DQN and most of modern value-based RL.

The Q-learning update

Recall the Bellman optimality equation $Q^{*} (s, a) = r (s, a) + γ E_{s^{'}} [max_{a^{'}} Q^{*} (s^{'}, a^{'})]$ from MDPs. Q-learning replaces the expectation with a single sample $(s, a, r, s^{'})$ and applies a stochastic-approximation update:

Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)] .

The bracketed quantity is the temporal-difference (TD) error. Watkins (1989) proved convergence to $Q^{*}$ under standard stochastic-approximation conditions: every $(s, a)$ visited infinitely often, learning rate satisfying $\sum_{t} α_{t} = \infty$ , $\sum_{t} α_{t}^{2} < \infty$ , and bounded rewards. This is the first general convergence guarantee for model-free RL.

The crucial point: the update does not require knowing the policy that produced the transitions. Q-learning is off-policy — you can learn the optimal policy from data generated by a different (e.g., random, exploratory) policy.

SARSA — on-policy TD control

SARSA uses the actually-taken next action $a^{'}$ instead of the greedy one:

Q (s, a) \leftarrow Q (s, a) + α [r + γ Q (s^{'}, a^{'}) - Q (s, a)] .

This estimates $Q^{π}$ for the current behaviour policy $π$ . The acronym is the five-tuple $(s, a, r, s^{'}, a^{'})$ . SARSA is on-policy — it learns the value of the policy actually being followed.

The practical difference: SARSA is more conservative in stochastic or risky environments. The classical example is the cliff-walk grid: with $ϵ$ -greedy exploration, Q-learning learns the optimal near-cliff path (high return when behaving greedily, occasional disasters when exploring), while SARSA learns a safer further-from-cliff path (lower return greedily, but rarely falls when exploring). Both are correct — they just optimise different objectives.

Exploration: $ϵ$ -greedy and beyond

A learner that always picks $\arg max_{a} Q (s, a)$ never visits actions that look bad early — it cannot recover from initial value-function errors. The simplest fix is $ϵ$ -greedy: with probability $ϵ$ choose a uniform random action, otherwise be greedy. Decay $ϵ$ from 1 to ~0.05 over training.

Better exploration strategies — Boltzmann (softmax over $Q$ ), UCB (upper confidence bound), Thompson sampling, intrinsic-motivation rewards — matter more in deep RL, where the value function is parameterised and can produce systematically biased estimates. In tabular Q-learning with small state spaces, $ϵ$ -greedy with a slow decay is usually fine.

Bootstrapping and the deadly triad

Q-learning's update is bootstrapped — the target $r + γ max_{a^{'}} Q (s^{'}, a^{'})$ uses the current estimate $Q$ , not a reference truth. This is the algorithmic source of much of RL's instability. Sutton's deadly triad names the three ingredients that combine into divergence in deep RL:

Function approximation — $Q$ parameterised by a network.
Bootstrapping — TD target uses current estimate.
Off-policy learning — data does not come from the policy being evaluated.

In tabular Q-learning the first ingredient is missing, so convergence is fine. In DQN, all three combine and require concrete fixes (target networks, experience replay, gradient clipping).

Maximisation bias and Double Q-Learning

The $max$ in Q-learning's target is biased upward when $Q$ has noise: $E [max (\hat{Q} (s^{'}, a^{'}))] > max E [\hat{Q} (s^{'}, a^{'})]$ . Double Q-Learning (Hasselt, NIPS 2010) maintains two estimators $Q_{A}, Q_{B}$ and decouples action selection from action evaluation:

Q_{A} (s, a) \leftarrow Q_{A} (s, a) + α [r + γ Q_{B} (s^{'}, \arg max_{a^{'}} Q_{A} (s^{'}, a^{'})) - Q_{A} (s, a)] .

This removes the systematic over-estimation. The trick lifts directly to Double DQN (van Hasselt et al., AAAI 2016) — the current default in deep value-based RL.

Tabular Q-Learning & SARSA ​

The Q-learning update ​

SARSA — on-policy TD control ​

Exploration: ϵ-greedy and beyond ​

Bootstrapping and the deadly triad ​

Maximisation bias and Double Q-Learning ​

What to read next ​