Multivariate Calculus & Gradients

Modern ML is gradient descent on high-dimensional surfaces. This page covers the calculus you need to follow that statement: gradients of multivariable functions, the Jacobian and Hessian, the chain rule in matrix notation, and the geometric reading that makes every later optimisation result intuitive.

Gradients

For a scalar function $f : R^{n} \to R$ , the gradient is the vector of partial derivatives:

\nabla f (x) = {(\frac{\partial f}{\partial x_{1}}, \dots, \frac{\partial f}{\partial x_{n}})}^{⊤} .

The geometric reading: $\nabla f (x)$ points in the direction of steepest ascent, and its magnitude is the rate of change in that direction. The directional derivative along unit $u$ is $\nabla f \cdot u$ , maximised when $u ∥ \nabla f$ .

Gradient descent $x_{t + 1} = x_{t} - η \nabla f (x_{t})$ steps opposite the gradient. Convergence depends on the local curvature of $f$ — flat surfaces converge slowly, sharp curvatures need small step sizes.

Jacobians

For a vector function $f : R^{n} \to R^{m}$ , the Jacobian is the $m \times n$ matrix of partial derivatives:

J_{i j} = \frac{\partial f_{i}}{\partial x_{j}} .

The first-order Taylor expansion is $f (x + δ) \approx f (x) + J (x) δ$ . Jacobians are how you propagate uncertainty (covariance), how reverse-mode autodiff is defined (vector-Jacobian products), and what the implicit function theorem operates on.

The chain rule, matrix-style

For $f = g \circ h$ with $h : R^{n} \to R^{k}$ and $g : R^{k} \to R^{m}$ ,

J_{f} (x) = J_{g} (h (x)) \cdot J_{h} (x) .

Composition of functions is multiplication of Jacobians. This is the source of backpropagation: each layer is a function in a chain, and the gradient of the loss with respect to early-layer parameters is a product of Jacobians from the loss back to that layer.

Hessians and second-order behaviour

The Hessian of a scalar $f$ is the matrix of second partial derivatives, $H_{i j} = \partial^{2} f / \partial x_{i} \partial x_{j}$ . For twice-continuously-differentiable $f$ , $H$ is symmetric (Schwarz's theorem). The second-order Taylor expansion is

f (x + δ) \approx f (x) + \nabla f (x)^{⊤} δ + \frac{1}{2} δ^{⊤} H (x) δ .

The eigenvalues of $H$ classify the local geometry:

All positive → local minimum.
All negative → local maximum.
Mixed signs → saddle point.
Some zero → degenerate (analyse higher-order terms).

In high-dimensional non-convex landscapes (deep networks), saddle points dominate critical points (Dauphin et al., 2014). Most stuck-at-zero-gradient situations are not local minima — they are saddle points that SGD escapes on its own through stochastic noise.

Useful gradient identities

Memorise these — they appear everywhere:

$\nabla_{x} (a^{⊤} x) = a$ .
$\nabla_{x} (x^{⊤} A x) = (A + A^{⊤}) x$ . For symmetric $A$ , this is $2 A x$ .
$\nabla_{x} ∥ x ∥_{2}^{2} = 2 x$ .
$\nabla_{x} ∥ A x - b ∥_{2}^{2} = 2 A^{⊤} (A x - b)$ — the OLS gradient (see OLS).

Forward-mode and reverse-mode autodiff

Two ways to compute gradients of compositions:

Forward mode — evaluate $f$ and its Jacobian-vector product $J v$ in one pass. Cost is $O (n \cdot cost (f))$ for an $n$ -input function. Right when there are few inputs (e.g., physics simulation with a few parameters).
Reverse mode — evaluate $f$ once forward (storing intermediates), then run a second pass backward to compute the vector-Jacobian product $v^{⊤} J$ . Cost is $O (cost (f))$ per output. Right when there are few outputs (e.g., a scalar loss). This is backpropagation.

For ML, reverse mode wins: the loss is a single scalar, the parameters are millions, and reverse-mode gives all gradients in one extra forward-pass cost.

Multivariate Calculus & Gradients ​

Gradients ​

Jacobians ​

The chain rule, matrix-style ​

Hessians and second-order behaviour ​

Useful gradient identities ​

Forward-mode and reverse-mode autodiff ​

What to read next ​