Skip to content

Multivariate Calculus & Gradients

Modern ML is gradient descent on high-dimensional surfaces. This page covers the calculus you need to follow that statement: gradients of multivariable functions, the Jacobian and Hessian, the chain rule in matrix notation, and the geometric reading that makes every later optimisation result intuitive.

Gradients

For a scalar function f:RnR, the gradient is the vector of partial derivatives:

f(x)=(fx1,,fxn).

The geometric reading: f(x) points in the direction of steepest ascent, and its magnitude is the rate of change in that direction. The directional derivative along unit u is fu, maximised when uf.

Gradient descent xt+1=xtηf(xt) steps opposite the gradient. Convergence depends on the local curvature of f — flat surfaces converge slowly, sharp curvatures need small step sizes.

Jacobians

For a vector function f:RnRm, the Jacobian is the m×n matrix of partial derivatives:

Jij=fixj.

The first-order Taylor expansion is f(x+δ)f(x)+J(x)δ. Jacobians are how you propagate uncertainty (covariance), how reverse-mode autodiff is defined (vector-Jacobian products), and what the implicit function theorem operates on.

The chain rule, matrix-style

For f=gh with h:RnRk and g:RkRm,

Jf(x)=Jg(h(x))Jh(x).

Composition of functions is multiplication of Jacobians. This is the source of backpropagation: each layer is a function in a chain, and the gradient of the loss with respect to early-layer parameters is a product of Jacobians from the loss back to that layer.

Hessians and second-order behaviour

The Hessian of a scalar f is the matrix of second partial derivatives, Hij=2f/xixj. For twice-continuously-differentiable f, H is symmetric (Schwarz's theorem). The second-order Taylor expansion is

f(x+δ)f(x)+f(x)δ+12δH(x)δ.

The eigenvalues of H classify the local geometry:

  • All positive → local minimum.
  • All negative → local maximum.
  • Mixed signs → saddle point.
  • Some zero → degenerate (analyse higher-order terms).

In high-dimensional non-convex landscapes (deep networks), saddle points dominate critical points (Dauphin et al., 2014). Most stuck-at-zero-gradient situations are not local minima — they are saddle points that SGD escapes on its own through stochastic noise.

Useful gradient identities

Memorise these — they appear everywhere:

  • x(ax)=a.
  • x(xAx)=(A+A)x. For symmetric A, this is 2Ax.
  • xx22=2x.
  • xAxb22=2A(Axb) — the OLS gradient (see OLS).

Forward-mode and reverse-mode autodiff

Two ways to compute gradients of compositions:

  • Forward mode — evaluate f and its Jacobian-vector product Jv in one pass. Cost is O(ncost(f)) for an n-input function. Right when there are few inputs (e.g., physics simulation with a few parameters).
  • Reverse mode — evaluate f once forward (storing intermediates), then run a second pass backward to compute the vector-Jacobian product vJ. Cost is O(cost(f)) per output. Right when there are few outputs (e.g., a scalar loss). This is backpropagation.

For ML, reverse mode wins: the loss is a single scalar, the parameters are millions, and reverse-mode gives all gradients in one extra forward-pass cost.

Released under the MIT License. Content imported and adapted from NoteNextra.