Policy Gradient & REINFORCE
Policy gradient methods are the alternative to value-based RL (Q-learning, DQN). Instead of learning a value function and acting greedily with respect to it, they parameterise the policy directly and follow the gradient of expected return. This makes them natural for continuous action spaces and stochastic policies, and lifts directly to modern actor-critic stacks like PPO and SAC, as well as to RLHF.
The objective
Parameterise the policy as
We want
The policy gradient theorem
The naive expression
with
REINFORCE
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992) named the algorithm. The update rule is plain SGD on the estimator above:
- Sample a trajectory
by rolling out . - Compute returns-to-go
. - Update
.
REINFORCE is unbiased but has enormous variance — the same trajectory can yield wildly different
Variance reduction: baselines
Subtract a state-dependent baseline
The unbiasedness follows from
The advantage
Generalised Advantage Estimation (GAE)
High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., ICLR 2016) addresses the bias-variance trade-off in advantage estimation. The "Monte Carlo" advantage
Continuous actions: Gaussian policies
For continuous action spaces, the policy is typically a Gaussian
The deeper view: scoring functions and RLHF
The score-function estimator generalises beyond RL. In RLHF, the same expression underlies the PPO update on a language model: the "action" is a token, the "policy" is the LM, the "reward" is from a human-feedback reward model. Reading RLHF after this page makes the connection explicit.
What to read next
- Actor-Critic — adding a learned baseline / value function.
- PPO & TRPO — the modern policy-gradient stack with trust regions.
- DQN — the value-based alternative.