Adaptive Optimisers — RMSProp, Adam, AdamW
SGD with momentum uses a single learning rate for every parameter. Adaptive optimisers maintain per-parameter step sizes that automatically shrink for high-curvature directions and grow for low-curvature ones, without the user tuning a per-parameter schedule. AdamW is the dominant optimiser for training Transformers; SGD remains competitive on CNN classification.
RMSProp — divide by recent gradient magnitude
Lecture 6e of CSC321 (Hinton, 2012, unpublished but widely cited) proposed a per-parameter step size based on a running estimate of the squared gradient:
Parameters with large recent gradients get small effective steps; parameters with small recent gradients get larger ones. RMSProp was the first widely used adaptive method and is still the default for training RNNs because it tames their characteristic exploding gradients.
Adam — momentum + RMSProp
Adam: A Method for Stochastic Optimization (Kingma, Ba, ICLR 2015) is RMSProp plus first-moment momentum. Maintain an exponential moving average of both the gradient
The two EMAs are biased toward zero at the start, so apply bias correction
Default hyperparameters
The Adam–weight-decay trap
In the original Adam, L2 regularisation is implemented as a gradient penalty:
Decoupled Weight Decay Regularization (Loshchilov, Hutter, ICLR 2019) introduced AdamW, which applies weight decay separately from the adaptive update:
For Transformers and other modern architectures, the difference between Adam and AdamW is large — AdamW is the universal default for LLM and ViT training. The original Adam should generally be considered legacy.
When Adam beats SGD, and when it doesn't
Modern empirical lore:
- Transformers, ViTs, NLP, RL — AdamW dominates. Adaptive step sizes appear necessary for the heterogeneous parameter scales in attention layers.
- CNN image classification — SGD with momentum + cosine schedule often matches or beats AdamW (The Marginal Value of Adaptive Gradient Methods in Machine Learning, Wilson et al., 2017), and produces flatter minima with better OOD generalisation.
The split has resisted unification. The pragmatic answer: AdamW is a stronger default; SGD wins when you can afford a long, careful schedule on a homogeneous architecture.
Variants worth knowing
- Adafactor (Shazeer, Stern, 2018) — Adam without the full second-moment matrix; instead, store row and column statistics. Memory-efficient enough to train large Transformers without optimiser-state offloading.
- Lion (Chen et al., 2023) — sign-of-update momentum, no second moment. Faster than AdamW on some workloads; sensitive to learning-rate tuning.
- LAMB (You et al., 2019) — layerwise-normalised AdamW for very large batch sizes, used to push BERT/ResNet training to enormous batches.
What to read next
- SGD, Momentum, Nesterov — the non-adaptive baseline these methods improve on.
- Learning Rate Schedules — choosing
over training. - PEFT — LoRA + AdamW is the standard fine-tuning stack.