Multi-Agent RL
Multi-agent RL (MARL) extends single-agent RL to environments with multiple learners. The single hardest fact: each agent's environment includes the other agents, and as their policies change during training, the environment dynamics from any one agent's perspective become non-stationary. Almost every MARL difficulty traces back to this.
The setting
Generalise the MDP to
Three regime classes:
- Cooperative — all agents share a single team reward
. Examples: StarCraft micromanagement, multi-robot warehouse coordination. - Competitive (zero-sum) —
. Examples: chess, Go, poker. - Mixed — neither. Examples: traffic routing, multi-agent economics.
The right algorithmic family depends on which regime you're in.
The non-stationarity problem
Naive independent learning — run Q-learning or PPO for each agent separately, treating other agents as part of the environment — fails for a structural reason. As agent
In practice, independent PPO often does work on cooperative tasks — the messy, non-stationary signal turns out to be acceptable when rewards are aligned. In competitive games it fails: opponents specifically learn to exploit the learner's stale assumptions.
Self-play and AlphaZero
For two-player zero-sum games, self-play is the canonical solution. Train a single network playing both sides; each policy improvement makes the opponent harder, which generates harder training data. Combined with Monte Carlo Tree Search (MCTS) and a value network, this is the AlphaGo / AlphaZero recipe (Silver et al., Science 2018):
- Run MCTS using current policy and value networks to produce strong moves.
- Update networks to imitate MCTS choices and predict eventual game outcome.
- Repeat against the latest network as opponent.
Convergence to Nash equilibrium is theoretically guaranteed for two-player zero-sum games (Heinrich & Silver, 2016, Fictitious Self-Play) and empirically robust on Go, chess, shogi, and StarCraft.
Centralised training, decentralised execution
A pragmatic compromise for cooperative MARL: at training time, give the critic access to global state and all agents' actions (a centralised critic), but keep each actor restricted to its local observation (decentralised execution). Examples:
- MADDPG (Lowe et al., NIPS 2017) — centralised Q for each agent, conditioned on all agents' actions. Each Q is trained with full information; each actor only uses local state at deploy.
- QMIX (Rashid et al., ICML 2018) — for cooperative tasks, factorise the joint Q-function as a monotonic mixing of per-agent Q-functions,
with monotone in each input. This guarantees that decentralised greedy action selection matches centralised joint maximisation.
Centralised-training/decentralised-execution is the dominant template for cooperative MARL benchmarks (StarCraft Multi-Agent Challenge, Hanabi, MPE).
Communication and emergent language
When agents share goals but partial observations, explicit communication channels can be added. Emergent communication research (Foerster et al., 2016; Lazaridou et al., 2017) trains differentiable channels and observes that agents develop their own protocols — sometimes interpretable, often not. This work feeds into modern multi-agent LLM research, where systems like multi-LLM debate or agent coordination are conceptually MARL with language as the action space.
Game-theoretic foundations
For competitive environments, MARL connects to game theory. Nash equilibrium, correlated equilibrium, and fictitious play all formalise solution concepts. Modern poker-bot work (Pluribus, DeepStack) blends counterfactual regret minimisation (CFR) with deep value networks to produce Nash-approximating policies in imperfect-information games.
What to read next
- PPO & TRPO — independent PPO is the cooperative-MARL baseline.
- World Models — opponent modelling is a model-based MARL strategy.
- LLM Agents — LLM-as-agent setups inherit MARL's coordination problems.