Reinforcement Learning — Overview
Reinforcement learning is the branch of machine learning in which an agent learns to act in an environment by observing rewards. Unlike supervised learning, the data the agent sees depends on the actions it takes — the model and the data distribution co-evolve. This page is a roadmap of the rest of the RL track and a quick statement of the framework.
What is the RL setting
An RL problem is specified by a Markov Decision Process (see MDPs & Bellman Equations):
- a state space
and action space , - a transition kernel
, - a reward function
, - a discount factor
.
The agent's behaviour is a policy
What makes RL hard is that learning is online and active: the agent must explore unseen actions, and bad early policies generate bad data, which is hard to recover from.
The methodological taxonomy
The track decomposes into four families:
- Value-based methods — learn a value function
or and act greedily. Covered in Q-Learning (tabular) and DQN (deep). - Policy-gradient methods — directly parameterise the policy and follow
. Covered in Policy Gradient, with the modern actor-critic family in Actor-Critic, PPO/TRPO, and DDPG/SAC. - Model-based methods — learn an explicit dynamics model and plan or train a policy inside it. Covered in World Models.
- Offline & multi-agent — RL with fixed datasets (Offline RL) and games with multiple learners (Multi-Agent RL).
Modern RL agents combine elements from several families. AlphaGo / AlphaZero are model-based + policy-gradient + value learning + Monte Carlo Tree Search; modern LLM RLHF combines policy gradient (PPO) with a learned reward model.
RL outside games
The RL framework reaches well past Atari and Go:
- Robotics — locomotion, manipulation, grasping. Sim-to-real domain transfer is the central challenge.
- LLM alignment — RLHF and RLVR are RL with human or verifiable rewards.
- Recommendation, ad ranking, energy management — sequential-decision settings where supervised learning under-fits.
What this track covers
The chapters below assume comfort with supervised deep learning, SGD, and basic probability. The progression is:
- MDPs and Bellman equations — the formalism.
- Tabular Q-learning — the algorithmic prototype, complete with proofs.
- Policy gradient — REINFORCE and the score-function estimator.
- DQN — deep value-based RL, Atari.
- Actor-critic, PPO/TRPO — the modern policy-gradient stack.
- DDPG / SAC — continuous-control RL.
- World models, offline RL, multi-agent RL — three frontier topics.
The point is to understand RL's framework well enough to follow modern RLHF and RLVR work in the LLM track — and to recognise when an RL formulation is and isn't the right tool.
What to read next
- MDPs & Bellman Equations — the formalism.
- Q-Learning — the simplest concrete algorithm.
- RLHF — RL applied to language model alignment.