Skip to content

Backpropagation

Backpropagation is the algorithm that makes deep learning possible. It is, mathematically, just reverse-mode automatic differentiation applied to a chain of differentiable operations: the chain rule, executed once forward to compute outputs and once backward to compute gradients with respect to every parameter. The cost is a constant factor over the forward pass, regardless of how many parameters there are.

The setup

A feed-forward network is a sequence of operations

h=f(h1;θ),=1,,L,

with h0=x the input, y^=hL the prediction, and L=(y^,y) the scalar loss. We want L/θ for every layer.

Forward and backward passes

The forward pass computes and stores each intermediate h. The backward pass propagates the gradient of the loss with respect to each layer's output:

Lh1=Lhfh1,Lθ=Lhfθ.

Both quantities are products of vector–Jacobian products, never explicit Jacobian matrices. This is what makes the cost a small constant times the forward pass: each layer needs only the output gradient and its locally-stored activations to produce the input gradient and the parameter gradient.

Worked example: one MLP layer

For a layer h=ϕ(Wx+b), denote z=Wx+b and let g=L/h flow in from above. Then

Lz=gϕ(z),LW=Lzx,Lb=Lz,Lx=WLz.

The implementation reads off directly: storing x and z during the forward pass, the backward pass is one element-wise product and two matrix multiplications.

Computational graphs and autodiff

Modern frameworks (PyTorch, JAX, TensorFlow) build a computational graph of forward operations and produce backward code automatically. The two paradigms:

  • Define-by-run (PyTorch's autograd) — build the graph dynamically as the forward pass executes.
  • Define-then-run / tracing (JAX grad, TF tf.function) — trace the function once, compile a static graph, run.

Both implement reverse-mode autodiff and reduce to backpropagation when the graph is a feed-forward chain. The autodiff abstraction generalises to RNN unrolls (truncated BPTT), recursive networks, and arbitrary control flow.

Failure modes: vanishing and exploding gradients

For a deep stack, the chain rule multiplies many Jacobians:

Lh0=LhL=1Lfh1.

If the operator norm of each Jacobian is consistently >1, gradients explode; if consistently <1, they vanish. The two structural fixes that made deep training possible — ReLU activations (whose Jacobian eigenvalues are 0 or 1, see activations) and residual connections (whose Jacobian is I+f/h, naturally close to 1) — both target this term. Gradient clipping is the standard runtime patch for explosions, particularly in RNNs.

Released under the MIT License. Content imported and adapted from NoteNextra.