Skip to content

DDPG, TD3, SAC

DDPG, TD3, and SAC are the dominant off-policy actor-critic algorithms for continuous control. They use a DQN-style replay buffer for sample efficiency, a deterministic or Gaussian-policy actor, and a Q-function critic. Each successor fixes a specific failure mode of the previous: TD3 addresses Q-value over-estimation; SAC adds a maximum-entropy objective for principled exploration and far better robustness.

DDPG — Deep Deterministic Policy Gradient

Continuous Control with Deep Reinforcement Learning (Lillicrap et al., ICLR 2016) extends DQN to continuous action spaces. The key trick: parameterise a deterministic policy μθ:SA alongside a Q-function Qϕ(s,a). Use the deterministic policy gradient:

θJ=Es[aQϕ(s,a)|a=μθ(s)θμθ(s)].

The argmaxaQ that DQN runs at every step is replaced by following μθ, whose parameters are trained to maximise Q via this gradient. Standard DQN engineering tricks transfer: replay buffer, target networks for both μ and Q, soft target updates. Exploration uses additive Ornstein-Uhlenbeck or Gaussian noise on the deterministic action.

DDPG was the first algorithm to learn high-dimensional continuous-control policies (Mujoco humanoid, robotic manipulation) end-to-end from low-level features.

DDPG's failure mode: Q over-estimation

DDPG inherits Q-learning's maximisation bias in the worst possible way: it bootstraps Q(s,μθ¯(s)) with μ chosen to maximise Q, so any over-estimation in Q feeds back into the target and amplifies. Empirically DDPG is brittle — careful hyperparameter tuning and per-environment tweaks are required.

TD3 — Twin Delayed DDPG

Addressing Function Approximation Error in Actor-Critic Methods (Fujimoto, van Hoof, Meger, ICML 2018) introduces three orthogonal fixes:

  • Twin Q-networks. Maintain two independent Q-networks Qϕ1,Qϕ2 and use the minimum for the bootstrap target, y=r+γmini=1,2Qϕ¯i(s,μθ¯(s)). This Clipped Double-Q approach systematically removes positive bias.
  • Delayed policy updates. Update the policy and target networks every d critic steps (typically d=2). The actor moves more slowly than the critic, breaking the over-estimation feedback loop.
  • Target policy smoothing. Add small noise ϵclip(N(0,σ),c,c) to the target action: a=μθ¯(s)+ϵ. Acts as a regulariser, smoothing the value estimate over an action neighbourhood.

These three changes together turn DDPG from "works with luck" to "works reliably" on Mujoco. TD3 is the right reference for "deterministic-policy actor-critic that works".

SAC — Soft Actor-Critic

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning (Haarnoja et al., ICML 2018) takes a different angle: optimise the maximum-entropy RL objective

J(π)=tE(st,at)[r(st,at)+αH[π(st)]].

The entropy bonus rewards stochastic policies. Three consequences:

  • Better exploration without ad-hoc noise schedules.
  • Smooth, robust policies that don't collapse to brittle deterministic optima.
  • Multi-modal action distributions when multiple actions are equally good.

SAC uses a Gaussian policy πθ(as)=N(μθ(s),Σθ(s)), with Twin Q networks (as in TD3), and an automatic entropy temperature that adapts α to maintain a target entropy. Practically it has fewer hyperparameters than TD3 and is the current default for continuous-control RL in robotics and Mujoco-style benchmarks.

When to use which

  • DDPG — historical reference; rarely the right choice today.
  • TD3 — works well when a deterministic policy is appropriate and you want simpler tuning than SAC.
  • SAC — strongest default for continuous control. Use this unless you have a specific reason not to.

For discrete actions none of these apply — use DQN variants or PPO. For on-policy continuous control where sample efficiency is less critical than stability or where you cannot afford a replay buffer, PPO remains a strong choice.

  • Actor-Critic — the algorithmic family these belong to.
  • DQN — the value-based predecessor whose tricks transferred.
  • Offline RL — extending these methods to fixed datasets.

Released under the MIT License. Content imported and adapted from NoteNextra.