Skip to content

PPO & TRPO

PPO (Proximal Policy Optimization) is the dominant on-policy RL algorithm in 2025 — used in robotics, game AI, and the RLHF pipeline that aligns large language models. It is a simplified, more practical descendant of TRPO (Trust Region Policy Optimization), which gave the field its first principled answer to "how do we keep policy updates from destabilising training?".

The trust-region idea

Vanilla policy gradient updates can be catastrophically large — one bad batch flips a working policy into a useless one. The fundamental insight: don't measure update size in parameter space (Δθ, which is meaningless across reparameterisations) but in policy space (KL(πθπθ)).

TRPO — the constrained problem

Trust Region Policy Optimization (Schulman, Levine, Abbeel et al., ICML 2015) maximises expected advantage subject to a KL constraint:

maxθE[πθ(as)πθold(as)A^(s,a)]s.t.E[KL(πθold(s)πθ(s))]δ.

Solved with a constrained second-order method: linearise the objective, quadratically approximate the KL constraint via the Fisher information matrix (the natural gradient direction), then a line search. The result is a guaranteed monotonic policy improvement bound, with δ0.01 working across Mujoco and Atari.

The drawback is engineering complexity — TRPO requires conjugate-gradient solves and is awkward to mix with parameter sharing or recurrent networks.

PPO — the practical simplification

Proximal Policy Optimization Algorithms (Schulman, Wolski, Dhariwal, Radford, Klimov, 2017) replaces TRPO's hard constraint with a clipped surrogate objective:

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)],

where rt(θ)=πθ(atst)/πθold(atst) is the importance ratio. Two regimes:

  • Positive advantage (A^t>0): clip prevents the gradient from increasing πθ(atst) beyond a factor of 1+ϵ. The model can still learn, but only up to a bounded "trust region" around πold.
  • Negative advantage (A^t<0): clip prevents pushing πθ(atst) below 1ϵ.

Standard ϵ=0.2. The full PPO loss adds a value-function loss and an entropy bonus:

LPPO(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2H[πθ].

PPO uses standard SGD/Adam, runs multiple epochs over each batch of rollouts (unlike TRPO's single update), and trivially supports any policy parameterisation including recurrent and attention-based ones. This combination — strong empirical performance plus low engineering tax — is why PPO became the default.

What PPO is and isn't

PPO is not a theoretical improvement over TRPO — TRPO's monotonic-improvement bound was rigorous; PPO's clipping is heuristic. What PPO is is much easier to engineer correctly. The "37 implementation details" list (Engstrom et al., ICLR 2020) catalogues the engineering choices that distinguish a working PPO from one that diverges silently — orthogonal initialisation, advantage normalisation, gradient clipping, learning-rate annealing, and several others all matter.

PPO in RLHF

The RLHF pipeline that produced ChatGPT runs PPO with a learned reward model. Specifics:

  • Policy = the language model.
  • Action = next token.
  • Reward = scalar from a separate reward model fitted on human preference labels.
  • KL penalty between the policy and a frozen SFT reference model — implemented as an additive reward βlog[πθ(yx)/πref(yx)] rather than a hard constraint.

The KL-to-reference penalty is the analogue of PPO's trust region in the language-model setting: it prevents the policy from drifting arbitrarily far from a known-good initialisation. DPO and follow-ups removed even the PPO machinery for many cases, but PPO-RLHF was the recipe that made aligned LLMs first viable.

  • Actor-Critic — the structural template both TRPO and PPO sit in.
  • DDPG, TD3, SAC — the off-policy continuous-control alternative.
  • RLHF — PPO applied to language models.

Released under the MIT License. Content imported and adapted from NoteNextra.