Skip to content

Multi-Agent RL

Multi-agent RL (MARL) extends single-agent RL to environments with multiple learners. The single hardest fact: each agent's environment includes the other agents, and as their policies change during training, the environment dynamics from any one agent's perspective become non-stationary. Almost every MARL difficulty traces back to this.

The setting

Generalise the MDP to N agents. The full game is a tuple (S,{Ai},P,{ri},γ) where each agent has its own action space Ai and reward ri. The transition P(ss,a1,,aN) depends on all agents' actions.

Three regime classes:

  • Cooperative — all agents share a single team reward ri=r. Examples: StarCraft micromanagement, multi-robot warehouse coordination.
  • Competitive (zero-sum)iri=0. Examples: chess, Go, poker.
  • Mixed — neither. Examples: traffic routing, multi-agent economics.

The right algorithmic family depends on which regime you're in.

The non-stationarity problem

Naive independent learning — run Q-learning or PPO for each agent separately, treating other agents as part of the environment — fails for a structural reason. As agent j's policy πj changes, the observed transition probabilities from agent i's perspective change. Agent i's value estimates become outdated, agent i updates against stale targets, the dynamics shift again. The environment is no longer a stationary MDP and standard convergence guarantees evaporate.

In practice, independent PPO often does work on cooperative tasks — the messy, non-stationary signal turns out to be acceptable when rewards are aligned. In competitive games it fails: opponents specifically learn to exploit the learner's stale assumptions.

Self-play and AlphaZero

For two-player zero-sum games, self-play is the canonical solution. Train a single network playing both sides; each policy improvement makes the opponent harder, which generates harder training data. Combined with Monte Carlo Tree Search (MCTS) and a value network, this is the AlphaGo / AlphaZero recipe (Silver et al., Science 2018):

  • Run MCTS using current policy and value networks to produce strong moves.
  • Update networks to imitate MCTS choices and predict eventual game outcome.
  • Repeat against the latest network as opponent.

Convergence to Nash equilibrium is theoretically guaranteed for two-player zero-sum games (Heinrich & Silver, 2016, Fictitious Self-Play) and empirically robust on Go, chess, shogi, and StarCraft.

Centralised training, decentralised execution

A pragmatic compromise for cooperative MARL: at training time, give the critic access to global state and all agents' actions (a centralised critic), but keep each actor restricted to its local observation (decentralised execution). Examples:

  • MADDPG (Lowe et al., NIPS 2017) — centralised Q for each agent, conditioned on all agents' actions. Each Q is trained with full information; each actor only uses local state at deploy.
  • QMIX (Rashid et al., ICML 2018) — for cooperative tasks, factorise the joint Q-function as a monotonic mixing of per-agent Q-functions, Qtot(s,a)=f(Q1,,QN) with f monotone in each input. This guarantees that decentralised greedy action selection argmaxaiQi matches centralised joint maximisation.

Centralised-training/decentralised-execution is the dominant template for cooperative MARL benchmarks (StarCraft Multi-Agent Challenge, Hanabi, MPE).

Communication and emergent language

When agents share goals but partial observations, explicit communication channels can be added. Emergent communication research (Foerster et al., 2016; Lazaridou et al., 2017) trains differentiable channels and observes that agents develop their own protocols — sometimes interpretable, often not. This work feeds into modern multi-agent LLM research, where systems like multi-LLM debate or agent coordination are conceptually MARL with language as the action space.

Game-theoretic foundations

For competitive environments, MARL connects to game theory. Nash equilibrium, correlated equilibrium, and fictitious play all formalise solution concepts. Modern poker-bot work (Pluribus, DeepStack) blends counterfactual regret minimisation (CFR) with deep value networks to produce Nash-approximating policies in imperfect-information games.

  • PPO & TRPO — independent PPO is the cooperative-MARL baseline.
  • World Models — opponent modelling is a model-based MARL strategy.
  • LLM Agents — LLM-as-agent setups inherit MARL's coordination problems.

Released under the MIT License. Content imported and adapted from NoteNextra.