Skip to content

RLHF — Reinforcement Learning from Human Feedback

The standard alignment pipeline from 2022–2024: after instruction tuning, fine-tune the model further with human preference data so it produces outputs people actually prefer.

The classical three-stage pipeline (InstructGPT)

  1. Supervised fine-tuning (SFT) — train on demonstrations of desired behaviour. (See Instruction Tuning.)

  2. Reward modelling (RM) — collect (x,yw,yl) triples where yw is the human-preferred response and yl the rejected one. Train a reward model rϕ minimising

    Lϕ=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))].
  3. PPO against the reward model — fine-tune the LM policy πθ to maximise rϕ(x,y) while staying close to the SFT reference πref:

    maxθExD,yπθ(x)[rϕ(x,y)]βKL[πθ(yx)πref(yx)].

This is what produced InstructGPT and, downstream, ChatGPT. PPO is the workhorse but it is finicky — it requires a value model, careful KL control, and a lot of GPU memory.

Direct Preference Optimization (DPO)

Rafailov et al. (2023) noticed that the optimal policy of the constrained objective above admits a closed form

π(yx)πref(yx)exp(1βr(x,y)),

which can be inverted to express the implicit reward as r(x,y)=βlogπ(yx)/πref(yx)+const. Substituting into the Bradley–Terry preference loss eliminates the explicit reward model entirely:

LDPO=E(x,yw,yl)logσ[βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x)].

DPO is one supervised pass — no on-policy sampling, no value head, no PPO. It is now the default for most open-source preference fine-tuning.

SimPO

SimPO (Meng et al., 2024) drops the reference model from DPO entirely and uses a length-normalised log-probability margin. Same loss shape, half the memory; competitive or better on many benchmarks.

Fine-grained feedback

The classical setup attaches one preference label to a whole response. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Wu et al., 2023) shows that span-level rewards along multiple axes (relevance, factuality, style) train materially better policies than monolithic preferences.

Reading list

  • Training language models to follow instructions with human feedback — Ouyang et al., NeurIPS 2022 (InstructGPT).
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., NeurIPS 2023.
  • SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al., 2024.
  • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training — Wu et al., NeurIPS 2023.
  • RLVR — verifiable-reward RL, the post-RLHF wave that powers reasoning models.
  • Efficient RLVR — making RL fine-tuning cheap enough to iterate on.

Released under the MIT License. Content imported and adapted from NoteNextra.