Offline RL
Offline RL — also called batch RL — learns a policy from a fixed, pre-collected dataset of
The core problem: distribution shift
Online RL relies on collecting fresh data with the current policy. Offline RL forbids this — the dataset
The TD target
This is the deadly triad (function approximation + bootstrapping + off-policy) under maximum stress. Standard fixes (replay buffer, target network) do not help — they were designed for online corrections that offline RL forbids.
Behaviour Cloning as a baseline
The simplest offline approach: behaviour cloning (BC) — supervised learning
Offline RL aims to do better than BC by using the reward signal — without falling into the distribution-shift trap.
Constraint-based methods
The dominant family addresses the OOD-action problem by constraining the learned policy to stay close to the dataset's action distribution.
- BCQ (Fujimoto, Meger, Precup, ICML 2019) — train a generative model
of , then restrict the actor's actions to perturbations of 's samples. Q evaluation never sees actions outside the data. - CQL (Kumar, Zhou, Tucker, Levine, NeurIPS 2020) — Conservative Q-Learning. Add a regulariser to the Q-loss that pushes Q-values down for OOD actions and up for in-distribution ones:
CQL gives a lower bound on the true Q-function, ensuring the policy never bootstraps from optimistic estimates. It's the most-cited modern offline RL baseline.
- IQL (Kostrikov, Nair, Levine, ICLR 2022) — Implicit Q-Learning. Estimate the value
via expectile regression on dataset returns, then update the policy via advantage-weighted behaviour cloning. Never queries Q at OOD actions — the entire OOD problem is sidestepped. Strong empirical performance with simpler tuning than CQL.
Sequence modelling: Decision Transformer
Decision Transformer: Reinforcement Learning via Sequence Modeling (Chen et al., NeurIPS 2021) reframes offline RL as supervised sequence learning. Train a Transformer to predict the next action given a sequence of
The framing eliminates Bellman bootstrapping entirely — there is no Q function, no value bootstrap, no distribution shift in the algorithmic sense. It is a clean recipe and competitive with CQL/IQL on D4RL benchmarks. The conceptual link to LLMs is exact: this is the same architecture and same training objective as a language model, applied to (state, action, return) sequences.
What offline RL is good for
- Healthcare — train treatment policies from observational records, where deploying an exploratory agent on real patients is unacceptable.
- Autonomous driving — millions of fleet hours of logged driving data, where on-policy exploration is dangerous.
- Recommender systems — production traffic logs, where exploring recommendations costs revenue.
- Robotics — combine with on-policy fine-tuning: pretrain offline on collected demonstrations + autonomous logs, then fine-tune online on task-specific exploration.
What to read next
- Q-Learning — the source of the deadly-triad failure offline RL has to fix.
- DDPG, TD3, SAC — the off-policy algorithms offline RL builds conservative versions of.
- World Models — model-based offline RL can plan inside the learned model.