Policy Gradient & REINFORCE

Policy gradient methods are the alternative to value-based RL (Q-learning, DQN). Instead of learning a value function and acting greedily with respect to it, they parameterise the policy directly and follow the gradient of expected return. This makes them natural for continuous action spaces and stochastic policies, and lifts directly to modern actor-critic stacks like PPO and SAC, as well as to RLHF.

The objective

Parameterise the policy as $π_{θ} (a ∣ s)$ — typically a neural network outputting action logits or distribution parameters. The objective is expected return:

J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t})] .

We want $\nabla_{θ} J (θ)$ so we can ascend it with SGD.

The policy gradient theorem

The naive expression $\nabla_{θ} J = \nabla_{θ} E_{τ} [R (τ)]$ is hard because the trajectory distribution depends on $θ$ . The log-derivative trick (also called REINFORCE or the score-function estimator) rewrites it as

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) \cdot R_{t} (τ)],

with $R_{t} = \sum_{k = t}^{T} γ^{k - t} r_{k}$ the return-to-go. The gradient is computable from sampled trajectories — no model of the environment dynamics required. The key step is $\nabla_{θ} p_{θ} = p_{θ} \nabla_{θ} \log p_{θ}$ , which trades the gradient of an expectation for the expectation of a (gradient × score) product.

REINFORCE

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992) named the algorithm. The update rule is plain SGD on the estimator above:

Sample a trajectory $τ$ by rolling out $π_{θ}$ .
Compute returns-to-go $R_{t}$ .
Update $θ \leftarrow θ + α \sum_{t} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) \cdot R_{t}$ .

REINFORCE is unbiased but has enormous variance — the same trajectory can yield wildly different $\nabla_{θ} J$ estimates depending on noise. The algorithm is correct in theory but rarely usable as-is on real problems.

Variance reduction: baselines

Subtract a state-dependent baseline $b (s_{t})$ from the return without changing the gradient's expectation:

\nabla_{θ} J (θ) = E [\sum_{t} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) \cdot (R_{t} - b (s_{t}))] .

The unbiasedness follows from $E_{a_{t} \sim π_{θ}} [\nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t})] = 0$ . The variance-minimising choice for $b (s_{t})$ is approximately the value function $V^{π} (s_{t})$ — and using $V^{π}$ as a learned baseline is exactly what actor-critic methods do (see Actor-Critic).

The advantage $A (s_{t}, a_{t}) = R_{t} - V^{π} (s_{t})$ measures how much better than average action $a_{t}$ was. Modern policy-gradient methods all use some form of advantage estimation.

Generalised Advantage Estimation (GAE)

High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., ICLR 2016) addresses the bias-variance trade-off in advantage estimation. The "Monte Carlo" advantage $R_{t} - V (s_{t})$ is unbiased but high-variance; the one-step TD advantage $r_{t} + γ V (s_{t + 1}) - V (s_{t})$ is low-variance but biased. GAE blends them via a $λ$ knob:

{\hat{A}}_{t}^{GAE} (λ) = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}, δ_{k} = r_{k} + γ V (s_{k + 1}) - V (s_{k}) .

$λ = 0$ recovers TD; $λ = 1$ recovers Monte Carlo. Typical values $λ \in [0.9, 0.99]$ . GAE is the default advantage estimator in PPO.

Continuous actions: Gaussian policies

For continuous action spaces, the policy is typically a Gaussian $π_{θ} (a ∣ s) = N (μ_{θ} (s), Σ_{θ} (s))$ . The score function is $\nabla_{θ} \log π_{θ} (a ∣ s) = \nabla_{θ} \log N (a; μ, Σ)$ , which is a closed-form Gaussian log-density gradient. Continuous-action policy gradient is what enables DDPG/SAC and most robotics RL.

The deeper view: scoring functions and RLHF

The score-function estimator generalises beyond RL. In RLHF, the same expression underlies the PPO update on a language model: the "action" is a token, the "policy" is the LM, the "reward" is from a human-feedback reward model. Reading RLHF after this page makes the connection explicit.

Policy Gradient & REINFORCE ​

The objective ​

The policy gradient theorem ​

REINFORCE ​

Variance reduction: baselines ​

Generalised Advantage Estimation (GAE) ​

Continuous actions: Gaussian policies ​

The deeper view: scoring functions and RLHF ​

What to read next ​