Backpropagation
Backpropagation is the algorithm that makes deep learning possible. It is, mathematically, just reverse-mode automatic differentiation applied to a chain of differentiable operations: the chain rule, executed once forward to compute outputs and once backward to compute gradients with respect to every parameter. The cost is a constant factor over the forward pass, regardless of how many parameters there are.
The setup
A feed-forward network is a sequence of operations
with
Forward and backward passes
The forward pass computes and stores each intermediate
Both quantities are products of vector–Jacobian products, never explicit Jacobian matrices. This is what makes the cost a small constant times the forward pass: each layer needs only the output gradient and its locally-stored activations to produce the input gradient and the parameter gradient.
Worked example: one MLP layer
For a layer
The implementation reads off directly: storing
Computational graphs and autodiff
Modern frameworks (PyTorch, JAX, TensorFlow) build a computational graph of forward operations and produce backward code automatically. The two paradigms:
- Define-by-run (PyTorch's autograd) — build the graph dynamically as the forward pass executes.
- Define-then-run / tracing (JAX
grad, TFtf.function) — trace the function once, compile a static graph, run.
Both implement reverse-mode autodiff and reduce to backpropagation when the graph is a feed-forward chain. The autodiff abstraction generalises to RNN unrolls (truncated BPTT), recursive networks, and arbitrary control flow.
Failure modes: vanishing and exploding gradients
For a deep stack, the chain rule multiplies many Jacobians:
If the operator norm of each Jacobian is consistently
What to read next
- Activation Functions — supplies
and shapes gradient flow. - Weight Initialization — keeps the per-layer Jacobian close to isometric at start.
- SGD, Momentum, Nesterov — what consumes the gradients backprop produces.