Adaptive Optimisers — RMSProp, Adam, AdamW

SGD with momentum uses a single learning rate for every parameter. Adaptive optimisers maintain per-parameter step sizes that automatically shrink for high-curvature directions and grow for low-curvature ones, without the user tuning a per-parameter schedule. AdamW is the dominant optimiser for training Transformers; SGD remains competitive on CNN classification.

RMSProp — divide by recent gradient magnitude

Lecture 6e of CSC321 (Hinton, 2012, unpublished but widely cited) proposed a per-parameter step size based on a running estimate of the squared gradient:

v_{t} = β v_{t - 1} + (1 - β) (\nabla_{t})^{2}, θ_{t + 1} = θ_{t} - η \cdot \frac{\nabla_{t}}{\sqrt{v_{t}} + ϵ} .

Parameters with large recent gradients get small effective steps; parameters with small recent gradients get larger ones. RMSProp was the first widely used adaptive method and is still the default for training RNNs because it tames their characteristic exploding gradients.

Adam — momentum + RMSProp

Adam: A Method for Stochastic Optimization (Kingma, Ba, ICLR 2015) is RMSProp plus first-moment momentum. Maintain an exponential moving average of both the gradient $m_{t}$ and its square $v_{t}$ :

\begin{aligned} m_{t} & = β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{t}, \\ v_{t} & = β_{2} v_{t - 1} + (1 - β_{2}) (\nabla_{t})^{2} . \end{aligned}

The two EMAs are biased toward zero at the start, so apply bias correction ${\hat{m}}_{t} = m_{t} / (1 - β_{1}^{t})$ , ${\hat{v}}_{t} = v_{t} / (1 - β_{2}^{t})$ , then update:

θ_{t + 1} = θ_{t} - η \cdot \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} .

Default hyperparameters $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 10^{- 8}$ work across an unreasonably wide range of problems. Adam's empirical robustness — fewer steps to convergence with default settings — is what made it the default optimiser of the deep-learning era.

The Adam–weight-decay trap

In the original Adam, L2 regularisation is implemented as a gradient penalty: $\nabla_{t} \leftarrow \nabla_{t} + λ θ_{t}$ . After the gradient enters the EMAs and is divided by $\sqrt{{\hat{v}}_{t}}$ , the effective regularisation gets coupled to the per-parameter learning rate — large parameters with small recent gradients get less regularised, the opposite of intent.

Decoupled Weight Decay Regularization (Loshchilov, Hutter, ICLR 2019) introduced AdamW, which applies weight decay separately from the adaptive update:

θ_{t + 1} = θ_{t} - η (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ θ_{t}) .

For Transformers and other modern architectures, the difference between Adam and AdamW is large — AdamW is the universal default for LLM and ViT training. The original Adam should generally be considered legacy.

When Adam beats SGD, and when it doesn't

Modern empirical lore:

Transformers, ViTs, NLP, RL — AdamW dominates. Adaptive step sizes appear necessary for the heterogeneous parameter scales in attention layers.
CNN image classification — SGD with momentum + cosine schedule often matches or beats AdamW (The Marginal Value of Adaptive Gradient Methods in Machine Learning, Wilson et al., 2017), and produces flatter minima with better OOD generalisation.

The split has resisted unification. The pragmatic answer: AdamW is a stronger default; SGD wins when you can afford a long, careful schedule on a homogeneous architecture.

Variants worth knowing

Adafactor (Shazeer, Stern, 2018) — Adam without the full second-moment matrix; instead, store row and column statistics. Memory-efficient enough to train large Transformers without optimiser-state offloading.
Lion (Chen et al., 2023) — sign-of-update momentum, no second moment. Faster than AdamW on some workloads; sensitive to learning-rate tuning.
LAMB (You et al., 2019) — layerwise-normalised AdamW for very large batch sizes, used to push BERT/ResNet training to enormous batches.

Adaptive Optimisers — RMSProp, Adam, AdamW ​

RMSProp — divide by recent gradient magnitude ​

Adam — momentum + RMSProp ​

The Adam–weight-decay trap ​

When Adam beats SGD, and when it doesn't ​

Variants worth knowing ​

What to read next ​