Deep Q-Networks (DQN)
DQN is Q-learning with a deep network as the function approximator. It looks deceptively simple — replace the table with a CNN — but making it actually train required two fixes that have become standard: experience replay and a target network. The result was the first algorithm to learn a wide range of Atari games end-to-end from raw pixels, and the start of deep RL as a discipline.
The setup
Playing Atari with Deep Reinforcement Learning (Mnih, Kavukcuoglu, Silver et al., NIPS DLW 2013) and Human-Level Control through Deep Reinforcement Learning (Mnih et al., Nature 2015) trained a single CNN to play 49 Atari 2600 games. The architecture: 4 stacked grayscale frames as input → CNN → fully-connected layer → one output per discrete action representing
The loss at each step is the squared TD error:
The two crucial algorithmic ingredients are how
Experience replay
Naive online Q-learning correlates consecutive samples (the agent's observation at time
Experience replay maintains a buffer
Prioritized Experience Replay (Schaul et al., ICLR 2016) refines this by sampling transitions in proportion to their TD error — high-error transitions are revisited more often. Modest but consistent improvements over uniform replay.
Target network
The TD target
The fix: maintain a separate target network with parameters
The Rainbow improvements
Rainbow: Combining Improvements in Deep Reinforcement Learning (Hessel et al., AAAI 2018) collected six independent DQN improvements and showed they compose cleanly:
- Double DQN (van Hasselt et al., AAAI 2016) — fixes the maximisation bias in the TD target by using
to select actions and to evaluate them. - Dueling networks (Wang et al., ICML 2016) — split the head into separate
and branches, giving better learning when many actions are equally bad. - Prioritised replay — see above.
- Multi-step returns — use
-step bootstrap targets ( ) instead of one-step. - Distributional RL (C51, Bellemare et al., ICML 2017) — model the distribution of returns rather than just the mean.
- Noisy networks — replace
-greedy with parameter-space noise injection for exploration.
Rainbow is the standard reference for "modern DQN" and the practical baseline modern value-based RL papers compare against.
Limitations
DQN works for discrete action spaces — taking
DQN is also notoriously sample-inefficient — Atari training uses 50–200M frames per game (about 38–138 days of human play time) and millions of GPU-hours. Modern improvements (model-based methods, parallel actors, better exploration) help, but model-free deep RL still trails human sample efficiency by orders of magnitude.
What to read next
- Q-Learning — the tabular precursor and convergence theory.
- Actor-Critic — the policy-gradient cousin with similar engineering tricks.
- DDPG, TD3, SAC — value-based methods for continuous actions.