Multivariate Calculus & Gradients
Modern ML is gradient descent on high-dimensional surfaces. This page covers the calculus you need to follow that statement: gradients of multivariable functions, the Jacobian and Hessian, the chain rule in matrix notation, and the geometric reading that makes every later optimisation result intuitive.
Gradients
For a scalar function
The geometric reading:
Gradient descent
Jacobians
For a vector function
The first-order Taylor expansion is
The chain rule, matrix-style
For
Composition of functions is multiplication of Jacobians. This is the source of backpropagation: each layer is a function in a chain, and the gradient of the loss with respect to early-layer parameters is a product of Jacobians from the loss back to that layer.
Hessians and second-order behaviour
The Hessian of a scalar
The eigenvalues of
- All positive → local minimum.
- All negative → local maximum.
- Mixed signs → saddle point.
- Some zero → degenerate (analyse higher-order terms).
In high-dimensional non-convex landscapes (deep networks), saddle points dominate critical points (Dauphin et al., 2014). Most stuck-at-zero-gradient situations are not local minima — they are saddle points that SGD escapes on its own through stochastic noise.
Useful gradient identities
Memorise these — they appear everywhere:
. . For symmetric , this is . . — the OLS gradient (see OLS).
Forward-mode and reverse-mode autodiff
Two ways to compute gradients of compositions:
- Forward mode — evaluate
and its Jacobian-vector product in one pass. Cost is for an -input function. Right when there are few inputs (e.g., physics simulation with a few parameters). - Reverse mode — evaluate
once forward (storing intermediates), then run a second pass backward to compute the vector-Jacobian product . Cost is per output. Right when there are few outputs (e.g., a scalar loss). This is backpropagation.
For ML, reverse mode wins: the loss is a single scalar, the parameters are millions, and reverse-mode gives all gradients in one extra forward-pass cost.
What to read next
- Linear Algebra Recap — Jacobians and Hessians live in matrix notation.
- Convex Optimization — first- and second-order conditions for the smooth-convex case.
- Backpropagation — reverse-mode autodiff applied to neural networks.