Second-Order & Natural Gradient

First-order optimisers (SGD, Adam) use only the gradient. Second-order methods use the Hessian — or a structured approximation to it — to take a step that respects the local curvature of the loss. Newton-like methods converge faster on well-conditioned problems but are too expensive in their pure form for deep networks. The practical compromise is the K-FAC / Shampoo line: Hessian-aware updates with manageable memory and compute.

Newton's method as the reference

For a quadratic loss $L (θ) = \frac{1}{2} θ^{⊤} H θ + g^{⊤} θ$ , the Newton step

θ_{t + 1} = θ_{t} - H^{- 1} \nabla L (θ_{t})

solves the optimum in one step. For non-quadratic losses, $H$ is the local Hessian and Newton converges quadratically near a local minimum — much faster than SGD's linear convergence. The cost is that storing and inverting a $P \times P$ Hessian is $O (P^{2})$ memory and $O (P^{3})$ flops, infeasible for $P > 10^{5}$ parameters.

Quasi-Newton: BFGS and L-BFGS

BFGS approximates $H^{- 1}$ with successive rank-1 or rank-2 updates from observed gradient differences, never instantiating $H$ explicitly. L-BFGS (limited-memory BFGS) keeps only the last $m$ updates, giving $O (m P)$ memory. Both are workhorses for batch non-convex optimisation in classical ML and remain useful for fine-tuning small models, hyperparameter optimisation, and physics-informed networks. They do not work well in the stochastic setting — noisy gradient estimates poison the curvature approximation — which is why neural-network training rarely uses them.

Natural gradient

Natural Gradient Works Efficiently in Learning (Amari, 1998) replaces the Hessian with the Fisher Information Matrix:

F (θ) = E_{x, y \sim p_{θ}} [\nabla_{θ} \log p_{θ} (y ∣ x) \nabla_{θ} \log p_{θ} (y ∣ x)^{⊤}] .

The natural gradient $F^{- 1} \nabla L$ takes the steepest direction in the probability space (KL divergence) rather than parameter space. This is invariant to reparameterisation: scaling weights or changing units does not change the optimisation trajectory. Natural gradient is the principled justification for many adaptive methods — Adam approximates the diagonal of $F$ .

Natural gradient is also the foundation of TRPO (trust-region policy optimisation), where it constrains policy updates to a fixed KL ball.

K-FAC — Kronecker-factored approximate curvature

Optimizing Neural Networks with Kronecker-factored Approximate Curvature (Martens, Grosse, ICML 2015) factorises $F$ as a Kronecker product of small matrices per layer. For a fully-connected layer with input $a$ and pre-activation gradient $g$ ,

F_{ℓ} \approx E [a a^{⊤}] \otimes E [g g^{⊤}] .

Inverting a Kronecker product is the Kronecker product of inverses, so $F_{ℓ}^{- 1}$ is computable in $O (d^{3})$ per layer instead of $O ((d^{2})^{3})$ . K-FAC matches Adam's per-step compute budget and converges in fewer steps on many problems.

Shampoo — multi-layer Kronecker preconditioning

Shampoo: Preconditioned Stochastic Tensor Optimization (Gupta, Koren, Singer, ICML 2018) extends K-FAC to general tensor parameters. For a weight tensor with axes $i, j, k, \dots$ , maintain one preconditioner per axis and combine them as a Kronecker product. Recent work (Distributed Shampoo, Anil et al., 2020) ports the method to large-batch industrial training and has produced state-of-the-art results on language and recommendation models. The current frontier of "is second-order ever worth it for deep nets" is being run on Shampoo and its descendants.

When second-order pays off

Practical heuristics:

Small-to-medium models, full-batch — L-BFGS is hard to beat.
Stochastic deep training — Adam/AdamW are usually faster wall-clock than K-FAC/Shampoo despite needing more steps.
Very large models with massive batch sizes — Shampoo and its variants close the gap with Adam and sometimes win, particularly when wall-clock is dominated by communication rather than compute.
RL / KL-constrained optimisation — natural gradient (and its descendants TRPO/PPO) is the standard.

Second-Order & Natural Gradient ​

Newton's method as the reference ​

Quasi-Newton: BFGS and L-BFGS ​

Natural gradient ​

K-FAC — Kronecker-factored approximate curvature ​

Shampoo — multi-layer Kronecker preconditioning ​

When second-order pays off ​

What to read next ​