Skip to content

Learning Rate Schedules

A learning rate η(t) that varies over training is almost always better than a constant. Two regimes argue for two opposite moves: early in training, gradients are noisy and parameters are far from optimal — large η explores; late in training, the loss surface is locally well-approximated by a quadratic — small η refines. This page surveys the standard schedules and the practical knobs.

Step decay

The classical schedule: hold η constant for K epochs, then divide by 10. Repeat. This was the default for ResNet/ImageNet training: η0=0.1, divide at epochs 30, 60, 90. Step decay is robust and easy to reason about, but the timing is hand-tuned per dataset and the discontinuities can briefly destabilise training.

Cosine annealing

SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov, Hutter, ICLR 2017) introduced cosine schedules:

η(t)=ηmin+12(ηmaxηmin)(1+cos(πtT)).

Smooth decay from ηmax to ηmin over T steps. Cosine is the dominant schedule for modern training — used in nearly every Transformer and ViT recipe — because it gives slow refinement near the end without the abrupt drops of step decay. SGDR also adds periodic restarts back to ηmax ("warm restarts"), useful for ensembling snapshot models.

Linear warmup

Large-model training almost always starts with a linear warmup from 0 (or some small η0) up to the peak rate over the first W steps, typically W[500,10000]:

η(t)=ηmaxmin(1,tW).

The reasons are several:

  • Adam's bias correction is poor at the start (v^t unreliable when t<1/β2); large η early can destabilise.
  • LayerNorm/BatchNorm statistics are unreliable until the network has seen enough data.
  • Residual networks at init are nearly identity; large updates from the first batch can flip the sign of the residual contribution.

Combined warmup + cosine decay (linear up, cosine down) is the canonical Transformer schedule.

Cyclical learning rates and one-cycle

Cyclical Learning Rates for Training Neural Networks (Smith, WACV 2017) found that oscillating η between low and high values often beats monotone decay for image classification. The same paper proposed the LR-range test: a short training run with η exponentially increasing each step, plotting loss vs η to find the largest stable rate. This single trick is the cheapest way to set ηmax on a new dataset.

Super-convergence and 1cycle (Smith, Topin, 2018) extend this: spend most of training at a high learning rate (one big triangular cycle), then anneal sharply at the end. Used aggressively in the fast.ai community, with DAWNBench-style records.

Inverse-square-root: the Transformer schedule

The original Transformer (Attention is All You Need, 2017) used the inverse-square-root schedule:

η(t)=dmodel0.5min(t0.5,tW1.5).

This is linear warmup followed by 1/t decay — the optimal rate for convex SGD. Modern LLMs replaced this with linear-warmup + cosine, which empirically gives lower final loss at the same compute, but inverse-sqrt remains in some translation/T5 recipes.

Practical recipe

For a new training run:

  1. LR-range test to find a stable maximum.
  2. Schedule = linear warmup (1–5% of total steps) → cosine decay to ~10% of peak.
  3. Total steps chosen by scaling laws or compute budget.
  4. If validation loss plateaus, try lower ηmin before adding more steps.

Released under the MIT License. Content imported and adapted from NoteNextra.