Actor–Critic, A2C, A3C

Actor-critic methods combine the two RL families: the actor is a parameterised policy (policy gradient), the critic is a learned value function (Q-learning/value approximation). The actor uses the critic as a baseline to reduce variance; the critic is trained with TD updates from the same trajectories the actor generates. Almost every modern RL algorithm — PPO, SAC, DDPG — is an actor-critic.

The actor-critic update

The policy-gradient theorem with a learned baseline $V_{ϕ} (s)$ becomes:

\nabla_{θ} J (θ) \approx E [\sum_{t} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) \cdot {\hat{A}}_{t}], {\hat{A}}_{t} = δ_{t} + γ δ_{t + 1} + \dots,

with TD errors $δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$ . The critic is trained by minimising

L_{critic} (ϕ) = E [(r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}))^{2}] .

Both updates run together — actor improves policy, critic improves value estimate, they make each other better.

Why an actor-critic, not just one or the other

Pure policy gradient (REINFORCE) is high-variance because returns from full Monte Carlo rollouts have large variance.
Pure value-based methods (DQN) require $\arg max$ over actions — fine for discrete, painful for continuous.
Actor-critic gets policy gradient's flexibility (stochastic, continuous, structured policies) plus value-based methods' variance reduction (TD target instead of full Monte Carlo).

The price is that the critic is biased — TD bootstrapping relies on the current value estimate. But the variance reduction wins overwhelmingly in practice, and modern algorithms (GAE) tune the bias-variance trade-off explicitly.

A3C — asynchronous parallel actors

Asynchronous Methods for Deep Reinforcement Learning (Mnih et al., ICML 2016) introduced A3C (Asynchronous Advantage Actor-Critic). Run many actors in parallel, each interacting with its own copy of the environment, all asynchronously updating a shared parameter server:

Actors collect rollouts in parallel, decorrelating their on-policy data.
Asynchronous updates take the place of experience replay — the diversity of concurrent actors playing different parts of the state space provides the i.i.d.-like signal.

A3C ran on CPU (no GPU needed for the small networks of the era) and matched DQN on Atari at a fraction of the wall-clock cost. The asynchronous part has fallen out of favour — synchronous A2C is simpler and works as well — but the multi-actor parallelism is the standard recipe in modern policy-gradient training.

A2C — synchronous variant

A2C (the synchronous version) does the same thing but waits for all actors to finish a rollout before applying a single batch update. Easier to reason about and reproduce, more sample-efficient on GPUs, and the standard scaffolding underneath PPO and most recent on-policy methods. The OpenAI Baselines / Stable-Baselines3 implementations are A2C-style for this reason.

On-policy vs off-policy

Vanilla actor-critic is on-policy — the critic and actor are trained on data from the current policy, so old data must be discarded. Off-policy actor-critic methods (DDPG, SAC, IMPALA) use a replay buffer and importance corrections, trading some bias for far better sample efficiency. The on/off-policy axis is one of the major design decisions in any actor-critic algorithm.

Entropy regularisation

A common addition to the actor loss is an entropy bonus $β H [π_{θ} (\cdot ∣ s_{t})]$ that rewards stochastic policies. Reasons:

Prevents premature collapse to deterministic policies that get stuck in local optima.
Increases exploration without explicit $ϵ$ -greedy machinery.
Has a clean interpretation as maximum-entropy RL, formalised by the SAC family.

Actor–Critic, A2C, A3C ​

The actor-critic update ​

Why an actor-critic, not just one or the other ​

A3C — asynchronous parallel actors ​

A2C — synchronous variant ​

On-policy vs off-policy ​

Entropy regularisation ​

What to read next ​