Learning Rate Schedules
A learning rate
Step decay
The classical schedule: hold
Cosine annealing
SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov, Hutter, ICLR 2017) introduced cosine schedules:
Smooth decay from
Linear warmup
Large-model training almost always starts with a linear warmup from 0 (or some small
The reasons are several:
- Adam's bias correction is poor at the start (
unreliable when ); large early can destabilise. - LayerNorm/BatchNorm statistics are unreliable until the network has seen enough data.
- Residual networks at init are nearly identity; large updates from the first batch can flip the sign of the residual contribution.
Combined warmup + cosine decay (linear up, cosine down) is the canonical Transformer schedule.
Cyclical learning rates and one-cycle
Cyclical Learning Rates for Training Neural Networks (Smith, WACV 2017) found that oscillating
Super-convergence and 1cycle (Smith, Topin, 2018) extend this: spend most of training at a high learning rate (one big triangular cycle), then anneal sharply at the end. Used aggressively in the fast.ai community, with DAWNBench-style records.
Inverse-square-root: the Transformer schedule
The original Transformer (Attention is All You Need, 2017) used the inverse-square-root schedule:
This is linear warmup followed by
Practical recipe
For a new training run:
- LR-range test to find a stable maximum.
- Schedule = linear warmup (1–5% of total steps) → cosine decay to ~10% of peak.
- Total steps chosen by scaling laws or compute budget.
- If validation loss plateaus, try lower
before adding more steps.
What to read next
- SGD, Momentum, Nesterov — the underlying optimiser.
- Adam, AdamW, RMSProp — adaptive optimisers benefit equally from warmup + cosine.
- Scaling Laws — how to budget total training steps.