Second-Order & Natural Gradient
First-order optimisers (SGD, Adam) use only the gradient. Second-order methods use the Hessian — or a structured approximation to it — to take a step that respects the local curvature of the loss. Newton-like methods converge faster on well-conditioned problems but are too expensive in their pure form for deep networks. The practical compromise is the K-FAC / Shampoo line: Hessian-aware updates with manageable memory and compute.
Newton's method as the reference
For a quadratic loss
solves the optimum in one step. For non-quadratic losses,
Quasi-Newton: BFGS and L-BFGS
BFGS approximates
Natural gradient
Natural Gradient Works Efficiently in Learning (Amari, 1998) replaces the Hessian with the Fisher Information Matrix:
The natural gradient
Natural gradient is also the foundation of TRPO (trust-region policy optimisation), where it constrains policy updates to a fixed KL ball.
K-FAC — Kronecker-factored approximate curvature
Optimizing Neural Networks with Kronecker-factored Approximate Curvature (Martens, Grosse, ICML 2015) factorises
Inverting a Kronecker product is the Kronecker product of inverses, so
Shampoo — multi-layer Kronecker preconditioning
Shampoo: Preconditioned Stochastic Tensor Optimization (Gupta, Koren, Singer, ICML 2018) extends K-FAC to general tensor parameters. For a weight tensor with axes
When second-order pays off
Practical heuristics:
- Small-to-medium models, full-batch — L-BFGS is hard to beat.
- Stochastic deep training — Adam/AdamW are usually faster wall-clock than K-FAC/Shampoo despite needing more steps.
- Very large models with massive batch sizes — Shampoo and its variants close the gap with Adam and sometimes win, particularly when wall-clock is dominated by communication rather than compute.
- RL / KL-constrained optimisation — natural gradient (and its descendants TRPO/PPO) is the standard.
What to read next
- SGD, Momentum, Nesterov — the first-order baseline.
- Adam, AdamW, RMSProp — diagonal-Fisher approximation; the closest first-order method to natural gradient.
- PPO/TRPO — natural-gradient-based RL.