Skip to content

Policy Gradient & REINFORCE

Policy gradient methods are the alternative to value-based RL (Q-learning, DQN). Instead of learning a value function and acting greedily with respect to it, they parameterise the policy directly and follow the gradient of expected return. This makes them natural for continuous action spaces and stochastic policies, and lifts directly to modern actor-critic stacks like PPO and SAC, as well as to RLHF.

The objective

Parameterise the policy as πθ(as) — typically a neural network outputting action logits or distribution parameters. The objective is expected return:

J(θ)=Eτπθ[t=0Tγtr(st,at)].

We want θJ(θ) so we can ascend it with SGD.

The policy gradient theorem

The naive expression θJ=θEτ[R(τ)] is hard because the trajectory distribution depends on θ. The log-derivative trick (also called REINFORCE or the score-function estimator) rewrites it as

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Rt(τ)],

with Rt=k=tTγktrk the return-to-go. The gradient is computable from sampled trajectories — no model of the environment dynamics required. The key step is θpθ=pθθlogpθ, which trades the gradient of an expectation for the expectation of a (gradient × score) product.

REINFORCE

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992) named the algorithm. The update rule is plain SGD on the estimator above:

  1. Sample a trajectory τ by rolling out πθ.
  2. Compute returns-to-go Rt.
  3. Update θθ+αtθlogπθ(atst)Rt.

REINFORCE is unbiased but has enormous variance — the same trajectory can yield wildly different θJ estimates depending on noise. The algorithm is correct in theory but rarely usable as-is on real problems.

Variance reduction: baselines

Subtract a state-dependent baseline b(st) from the return without changing the gradient's expectation:

θJ(θ)=E[tθlogπθ(atst)(Rtb(st))].

The unbiasedness follows from Eatπθ[θlogπθ(atst)]=0. The variance-minimising choice for b(st) is approximately the value function Vπ(st) — and using Vπ as a learned baseline is exactly what actor-critic methods do (see Actor-Critic).

The advantage A(st,at)=RtVπ(st) measures how much better than average action at was. Modern policy-gradient methods all use some form of advantage estimation.

Generalised Advantage Estimation (GAE)

High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., ICLR 2016) addresses the bias-variance trade-off in advantage estimation. The "Monte Carlo" advantage RtV(st) is unbiased but high-variance; the one-step TD advantage rt+γV(st+1)V(st) is low-variance but biased. GAE blends them via a λ knob:

A^tGAE(λ)=l=0(γλ)lδt+l,δk=rk+γV(sk+1)V(sk).

λ=0 recovers TD; λ=1 recovers Monte Carlo. Typical values λ[0.9,0.99]. GAE is the default advantage estimator in PPO.

Continuous actions: Gaussian policies

For continuous action spaces, the policy is typically a Gaussian πθ(as)=N(μθ(s),Σθ(s)). The score function is θlogπθ(as)=θlogN(a;μ,Σ), which is a closed-form Gaussian log-density gradient. Continuous-action policy gradient is what enables DDPG/SAC and most robotics RL.

The deeper view: scoring functions and RLHF

The score-function estimator generalises beyond RL. In RLHF, the same expression underlies the PPO update on a language model: the "action" is a token, the "policy" is the LM, the "reward" is from a human-feedback reward model. Reading RLHF after this page makes the connection explicit.

  • Actor-Critic — adding a learned baseline / value function.
  • PPO & TRPO — the modern policy-gradient stack with trust regions.
  • DQN — the value-based alternative.

Released under the MIT License. Content imported and adapted from NoteNextra.