Skip to content

SGD, Momentum, Nesterov

Stochastic Gradient Descent (SGD) and its momentum variants are the simplest, oldest, and — for many tasks — still the best optimisers for training neural networks. The recipe is two lines of code, but the dynamics it produces (escape from saddle points, implicit regularisation toward flat minima) underpin the empirical success of deep learning.

Plain SGD

Given a loss L(θ)=E(x,y)D[(θ;x,y)], full-batch gradient descent updates θt+1=θtηL(θt) — but the dataset is too large to evaluate the full gradient each step. SGD replaces the expectation with a mini-batch estimate:

θt+1=θtη1Bibatcht(θt;xi,yi).

The mini-batch gradient is unbiased but noisy. The noise is not just a budget compromise — it has been shown empirically to bias SGD toward flatter minima that generalise better than the sharp minima found by full-batch optimisation (Keskar et al., 2017).

Heavy-ball momentum

A method of solving a convex programming problem with convergence rate O(1/k2) (Polyak, 1964) introduced momentum:

vt+1=μvt+L(θt),θt+1=θtηvt+1.

The velocity v accumulates past gradients with exponential decay μ (typically 0.9). Effects:

  • Smoothing — short-timescale gradient noise averages out.
  • Acceleration — in directions where gradients consistently point the same way, the effective step size grows by 1/(1μ)10.
  • Damping — in oscillating directions (e.g., across a narrow valley), opposing gradients cancel.

Momentum is what makes SGD competitive with second-order methods on quadratic problems and is the reason "SGD with momentum" is the standard reference, not pure SGD.

Nesterov accelerated gradient

Nesterov's twist (1983) is to evaluate the gradient at the lookahead position θtημvt rather than at θt:

vt+1=μvt+L(θtημvt),θt+1=θtηvt+1.

The lookahead lets the optimiser correct course before the momentum carries it past a curving valley. On smooth convex problems, NAG achieves the optimal O(1/T2) convergence rate vs. heavy-ball's O(1/T). The deep-learning gain is more modest but consistent — a small but reliable boost over plain momentum.

Mini-batch size and the linear-scaling rule

Empirically, when batch size B doubles, the right learning rate roughly doubles too (Goyal et al., 2017, Accurate, Large Minibatch SGD). The intuition: the variance of the mini-batch gradient scales as 1/B, and so does the safe step size — but in the opposite direction, the effective per-epoch progress wants to scale linearly in batch size. Below some critical batch size, this rule holds; above it, returns diminish (McCandlish et al., 2018).

Convergence theory and SGD's blind spots

For convex losses, SGD with ηt1/t converges at O(1/T). For non-convex losses, only convergence to a stationary point (gradient norm 0) is guaranteed. Why SGD finds good stationary points for deep networks — flat, generalisation-friendly minima — is a topic of implicit-bias research and is still open in full generality.

Released under the MIT License. Content imported and adapted from NoteNextra.