DDPG, TD3, SAC
DDPG, TD3, and SAC are the dominant off-policy actor-critic algorithms for continuous control. They use a DQN-style replay buffer for sample efficiency, a deterministic or Gaussian-policy actor, and a Q-function critic. Each successor fixes a specific failure mode of the previous: TD3 addresses Q-value over-estimation; SAC adds a maximum-entropy objective for principled exploration and far better robustness.
DDPG — Deep Deterministic Policy Gradient
Continuous Control with Deep Reinforcement Learning (Lillicrap et al., ICLR 2016) extends DQN to continuous action spaces. The key trick: parameterise a deterministic policy
The
DDPG was the first algorithm to learn high-dimensional continuous-control policies (Mujoco humanoid, robotic manipulation) end-to-end from low-level features.
DDPG's failure mode: Q over-estimation
DDPG inherits Q-learning's maximisation bias in the worst possible way: it bootstraps
TD3 — Twin Delayed DDPG
Addressing Function Approximation Error in Actor-Critic Methods (Fujimoto, van Hoof, Meger, ICML 2018) introduces three orthogonal fixes:
- Twin Q-networks. Maintain two independent Q-networks
and use the minimum for the bootstrap target, . This Clipped Double-Q approach systematically removes positive bias. - Delayed policy updates. Update the policy and target networks every
critic steps (typically ). The actor moves more slowly than the critic, breaking the over-estimation feedback loop. - Target policy smoothing. Add small noise
to the target action: . Acts as a regulariser, smoothing the value estimate over an action neighbourhood.
These three changes together turn DDPG from "works with luck" to "works reliably" on Mujoco. TD3 is the right reference for "deterministic-policy actor-critic that works".
SAC — Soft Actor-Critic
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning (Haarnoja et al., ICML 2018) takes a different angle: optimise the maximum-entropy RL objective
The entropy bonus rewards stochastic policies. Three consequences:
- Better exploration without ad-hoc noise schedules.
- Smooth, robust policies that don't collapse to brittle deterministic optima.
- Multi-modal action distributions when multiple actions are equally good.
SAC uses a Gaussian policy
When to use which
- DDPG — historical reference; rarely the right choice today.
- TD3 — works well when a deterministic policy is appropriate and you want simpler tuning than SAC.
- SAC — strongest default for continuous control. Use this unless you have a specific reason not to.
For discrete actions none of these apply — use DQN variants or PPO. For on-policy continuous control where sample efficiency is less critical than stability or where you cannot afford a replay buffer, PPO remains a strong choice.
What to read next
- Actor-Critic — the algorithmic family these belong to.
- DQN — the value-based predecessor whose tricks transferred.
- Offline RL — extending these methods to fixed datasets.