RLHF — Reinforcement Learning from Human Feedback

The standard alignment pipeline from 2022–2024: after instruction tuning, fine-tune the model further with human preference data so it produces outputs people actually prefer.

The classical three-stage pipeline (InstructGPT)

Supervised fine-tuning (SFT) — train on demonstrations of desired behaviour. (See Instruction Tuning.)
Reward modelling (RM) — collect $(x, y_{w}, y_{l})$ triples where $y_{w}$ is the human-preferred response and $y_{l}$ the rejected one. Train a reward model $r_{ϕ}$ minimising
$L_{ϕ} = - E_{(x, y_{w}, y_{l})} [\log σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))] .$
PPO against the reward model — fine-tune the LM policy $π_{θ}$ to maximise $r_{ϕ} (x, y)$ while staying close to the SFT reference $π_{ref}$ :
$max_{θ} E_{x \sim D, y \sim π_{θ} (\cdot ∣ x)} [r_{ϕ} (x, y)] - β KL [π_{θ} (y ∣ x) ∥ π_{ref} (y ∣ x)] .$

This is what produced InstructGPT and, downstream, ChatGPT. PPO is the workhorse but it is finicky — it requires a value model, careful KL control, and a lot of GPU memory.

Direct Preference Optimization (DPO)

Rafailov et al. (2023) noticed that the optimal policy of the constrained objective above admits a closed form

π^{*} (y ∣ x) \propto π_{ref} (y ∣ x) \exp (\frac{1}{β} r (x, y)),

which can be inverted to express the implicit reward as $r (x, y) = β \log π^{*} (y ∣ x) / π_{ref} (y ∣ x) + const$ . Substituting into the Bradley–Terry preference loss eliminates the explicit reward model entirely:

L_{DPO} = - E_{(x, y_{w}, y_{l})} \log σ [β \log \frac{π_{θ} (y_{w} | x)}{π_{ref} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{ref} (y_{l} | x)}] .

DPO is one supervised pass — no on-policy sampling, no value head, no PPO. It is now the default for most open-source preference fine-tuning.

SimPO

SimPO (Meng et al., 2024) drops the reference model from DPO entirely and uses a length-normalised log-probability margin. Same loss shape, half the memory; competitive or better on many benchmarks.

Fine-grained feedback

The classical setup attaches one preference label to a whole response. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Wu et al., 2023) shows that span-level rewards along multiple axes (relevance, factuality, style) train materially better policies than monolithic preferences.

Reading list

Training language models to follow instructions with human feedback — Ouyang et al., NeurIPS 2022 (InstructGPT).
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., NeurIPS 2023.
SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al., 2024.
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training — Wu et al., NeurIPS 2023.

RLHF — Reinforcement Learning from Human Feedback ​

The classical three-stage pipeline (InstructGPT) ​

Direct Preference Optimization (DPO) ​

SimPO ​

Fine-grained feedback ​

Reading list ​

What to read next ​