SGD, Momentum, Nesterov
Stochastic Gradient Descent (SGD) and its momentum variants are the simplest, oldest, and — for many tasks — still the best optimisers for training neural networks. The recipe is two lines of code, but the dynamics it produces (escape from saddle points, implicit regularisation toward flat minima) underpin the empirical success of deep learning.
Plain SGD
Given a loss
The mini-batch gradient is unbiased but noisy. The noise is not just a budget compromise — it has been shown empirically to bias SGD toward flatter minima that generalise better than the sharp minima found by full-batch optimisation (Keskar et al., 2017).
Heavy-ball momentum
A method of solving a convex programming problem with convergence rate
The velocity
- Smoothing — short-timescale gradient noise averages out.
- Acceleration — in directions where gradients consistently point the same way, the effective step size grows by
. - Damping — in oscillating directions (e.g., across a narrow valley), opposing gradients cancel.
Momentum is what makes SGD competitive with second-order methods on quadratic problems and is the reason "SGD with momentum" is the standard reference, not pure SGD.
Nesterov accelerated gradient
Nesterov's twist (1983) is to evaluate the gradient at the lookahead position
The lookahead lets the optimiser correct course before the momentum carries it past a curving valley. On smooth convex problems, NAG achieves the optimal
Mini-batch size and the linear-scaling rule
Empirically, when batch size
Convergence theory and SGD's blind spots
For convex losses, SGD with
What to read next
- Adam, AdamW, RMSProp — adaptive optimisers that match or beat SGD on many modern workloads.
- Learning Rate Schedules — the third hyperparameter, after batch size and momentum.
- Backpropagation — what produces the gradients SGD consumes.