Learning Rate Schedules

A learning rate $η (t)$ that varies over training is almost always better than a constant. Two regimes argue for two opposite moves: early in training, gradients are noisy and parameters are far from optimal — large $η$ explores; late in training, the loss surface is locally well-approximated by a quadratic — small $η$ refines. This page surveys the standard schedules and the practical knobs.

Step decay

The classical schedule: hold $η$ constant for $K$ epochs, then divide by 10. Repeat. This was the default for ResNet/ImageNet training: $η_{0} = 0.1$ , divide at epochs 30, 60, 90. Step decay is robust and easy to reason about, but the timing is hand-tuned per dataset and the discontinuities can briefly destabilise training.

Cosine annealing

SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov, Hutter, ICLR 2017) introduced cosine schedules:

η (t) = η_{min} + \frac{1}{2} (η_{max} - η_{min}) (1 + \cos (π \frac{t}{T})) .

Smooth decay from $η_{max}$ to $η_{min}$ over $T$ steps. Cosine is the dominant schedule for modern training — used in nearly every Transformer and ViT recipe — because it gives slow refinement near the end without the abrupt drops of step decay. SGDR also adds periodic restarts back to $η_{max}$ ("warm restarts"), useful for ensembling snapshot models.

Linear warmup

Large-model training almost always starts with a linear warmup from 0 (or some small $η_{0}$ ) up to the peak rate over the first $W$ steps, typically $W \in [500, 10000]$ :

η (t) = η_{max} \cdot min (1, \frac{t}{W}) .

The reasons are several:

Adam's bias correction is poor at the start ( ${\hat{v}}_{t}$ unreliable when $t < 1 / β_{2}$ ); large $η$ early can destabilise.
LayerNorm/BatchNorm statistics are unreliable until the network has seen enough data.
Residual networks at init are nearly identity; large updates from the first batch can flip the sign of the residual contribution.

Combined warmup + cosine decay (linear up, cosine down) is the canonical Transformer schedule.

Cyclical learning rates and one-cycle

Cyclical Learning Rates for Training Neural Networks (Smith, WACV 2017) found that oscillating $η$ between low and high values often beats monotone decay for image classification. The same paper proposed the LR-range test: a short training run with $η$ exponentially increasing each step, plotting loss vs $η$ to find the largest stable rate. This single trick is the cheapest way to set $η_{max}$ on a new dataset.

Super-convergence and 1cycle (Smith, Topin, 2018) extend this: spend most of training at a high learning rate (one big triangular cycle), then anneal sharply at the end. Used aggressively in the fast.ai community, with DAWNBench-style records.

Inverse-square-root: the Transformer schedule

The original Transformer (Attention is All You Need, 2017) used the inverse-square-root schedule:

η (t) = d_{model}^{- 0.5} \cdot min (t^{- 0.5}, t \cdot W^{- 1.5}) .

This is linear warmup followed by $1 / \sqrt{t}$ decay — the optimal rate for convex SGD. Modern LLMs replaced this with linear-warmup + cosine, which empirically gives lower final loss at the same compute, but inverse-sqrt remains in some translation/T5 recipes.

Practical recipe

For a new training run:

LR-range test to find a stable maximum.
Schedule = linear warmup (1–5% of total steps) → cosine decay to ~10% of peak.
Total steps chosen by scaling laws or compute budget.
If validation loss plateaus, try lower $η_{min}$ before adding more steps.

Learning Rate Schedules ​

Step decay ​

Cosine annealing ​

Linear warmup ​

Cyclical learning rates and one-cycle ​

Inverse-square-root: the Transformer schedule ​

Practical recipe ​

What to read next ​