PPO & TRPO

PPO (Proximal Policy Optimization) is the dominant on-policy RL algorithm in 2025 — used in robotics, game AI, and the RLHF pipeline that aligns large language models. It is a simplified, more practical descendant of TRPO (Trust Region Policy Optimization), which gave the field its first principled answer to "how do we keep policy updates from destabilising training?".

The trust-region idea

Vanilla policy gradient updates can be catastrophically large — one bad batch flips a working policy into a useless one. The fundamental insight: don't measure update size in parameter space ( $∥ Δ θ ∥$ , which is meaningless across reparameterisations) but in policy space ( $KL (π_{θ} ∥ π_{θ^{'}})$ ).

TRPO — the constrained problem

Trust Region Policy Optimization (Schulman, Levine, Abbeel et al., ICML 2015) maximises expected advantage subject to a KL constraint:

max_{θ} E [\frac{π_{θ} (a ∣ s)}{π_{θ_{old}} (a ∣ s)} \hat{A} (s, a)] s.t. E [KL (π_{θ_{old}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))] \leq δ .

Solved with a constrained second-order method: linearise the objective, quadratically approximate the KL constraint via the Fisher information matrix (the natural gradient direction), then a line search. The result is a guaranteed monotonic policy improvement bound, with $δ \approx 0.01$ working across Mujoco and Atari.

The drawback is engineering complexity — TRPO requires conjugate-gradient solves and is awkward to mix with parameter sharing or recurrent networks.

PPO — the practical simplification

Proximal Policy Optimization Algorithms (Schulman, Wolski, Dhariwal, Radford, Klimov, 2017) replaces TRPO's hard constraint with a clipped surrogate objective:

L^{CLIP} (θ) = E [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

where $r_{t} (θ) = π_{θ} (a_{t} ∣ s_{t}) / π_{θ_{old}} (a_{t} ∣ s_{t})$ is the importance ratio. Two regimes:

Positive advantage ( ${\hat{A}}_{t} > 0$ ): clip prevents the gradient from increasing $π_{θ} (a_{t} ∣ s_{t})$ beyond a factor of $1 + ϵ$ . The model can still learn, but only up to a bounded "trust region" around $π_{old}$ .
Negative advantage ( ${\hat{A}}_{t} < 0$ ): clip prevents pushing $π_{θ} (a_{t} ∣ s_{t})$ below $1 - ϵ$ .

Standard $ϵ = 0.2$ . The full PPO loss adds a value-function loss and an entropy bonus:

L^{PPO} (θ, ϕ) = L^{CLIP} (θ) - c_{1} L^{VF} (ϕ) + c_{2} H [π_{θ}] .

PPO uses standard SGD/Adam, runs multiple epochs over each batch of rollouts (unlike TRPO's single update), and trivially supports any policy parameterisation including recurrent and attention-based ones. This combination — strong empirical performance plus low engineering tax — is why PPO became the default.

What PPO is and isn't

PPO is not a theoretical improvement over TRPO — TRPO's monotonic-improvement bound was rigorous; PPO's clipping is heuristic. What PPO is is much easier to engineer correctly. The "37 implementation details" list (Engstrom et al., ICLR 2020) catalogues the engineering choices that distinguish a working PPO from one that diverges silently — orthogonal initialisation, advantage normalisation, gradient clipping, learning-rate annealing, and several others all matter.

PPO in RLHF

The RLHF pipeline that produced ChatGPT runs PPO with a learned reward model. Specifics:

Policy = the language model.
Action = next token.
Reward = scalar from a separate reward model fitted on human preference labels.
KL penalty between the policy and a frozen SFT reference model — implemented as an additive reward $- β \log [π_{θ} (y ∣ x) / π_{ref} (y ∣ x)]$ rather than a hard constraint.

The KL-to-reference penalty is the analogue of PPO's trust region in the language-model setting: it prevents the policy from drifting arbitrarily far from a known-good initialisation. DPO and follow-ups removed even the PPO machinery for many cases, but PPO-RLHF was the recipe that made aligned LLMs first viable.

PPO & TRPO ​

The trust-region idea ​

TRPO — the constrained problem ​

PPO — the practical simplification ​

What PPO is and isn't ​

PPO in RLHF ​

What to read next ​