RLHF — Reinforcement Learning from Human Feedback
The standard alignment pipeline from 2022–2024: after instruction tuning, fine-tune the model further with human preference data so it produces outputs people actually prefer.
The classical three-stage pipeline (InstructGPT)
Supervised fine-tuning (SFT) — train on demonstrations of desired behaviour. (See Instruction Tuning.)
Reward modelling (RM) — collect
triples where is the human-preferred response and the rejected one. Train a reward model minimising PPO against the reward model — fine-tune the LM policy
to maximise while staying close to the SFT reference :
This is what produced InstructGPT and, downstream, ChatGPT. PPO is the workhorse but it is finicky — it requires a value model, careful KL control, and a lot of GPU memory.
Direct Preference Optimization (DPO)
Rafailov et al. (2023) noticed that the optimal policy of the constrained objective above admits a closed form
which can be inverted to express the implicit reward as
DPO is one supervised pass — no on-policy sampling, no value head, no PPO. It is now the default for most open-source preference fine-tuning.
SimPO
SimPO (Meng et al., 2024) drops the reference model from DPO entirely and uses a length-normalised log-probability margin. Same loss shape, half the memory; competitive or better on many benchmarks.
Fine-grained feedback
The classical setup attaches one preference label to a whole response. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Wu et al., 2023) shows that span-level rewards along multiple axes (relevance, factuality, style) train materially better policies than monolithic preferences.
Reading list
- Training language models to follow instructions with human feedback — Ouyang et al., NeurIPS 2022 (InstructGPT).
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., NeurIPS 2023.
- SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al., 2024.
- Fine-Grained Human Feedback Gives Better Rewards for Language Model Training — Wu et al., NeurIPS 2023.
What to read next
- RLVR — verifiable-reward RL, the post-RLHF wave that powers reasoning models.
- Efficient RLVR — making RL fine-tuning cheap enough to iterate on.