Backpropagation

Backpropagation is the algorithm that makes deep learning possible. It is, mathematically, just reverse-mode automatic differentiation applied to a chain of differentiable operations: the chain rule, executed once forward to compute outputs and once backward to compute gradients with respect to every parameter. The cost is a constant factor over the forward pass, regardless of how many parameters there are.

The setup

A feed-forward network is a sequence of operations

h_{ℓ} = f_{ℓ} (h_{ℓ - 1}; θ_{ℓ}), ℓ = 1, \dots, L,

with $h_{0} = x$ the input, $\hat{y} = h_{L}$ the prediction, and $L = ℓ (\hat{y}, y)$ the scalar loss. We want $\partial L / \partial θ_{ℓ}$ for every layer.

Forward and backward passes

The forward pass computes and stores each intermediate $h_{ℓ}$ . The backward pass propagates the gradient of the loss with respect to each layer's output:

\frac{\partial L}{\partial h_{ℓ - 1}} = \frac{\partial L}{\partial h_{ℓ}} \cdot \frac{\partial f_{ℓ}}{\partial h_{ℓ - 1}}, \frac{\partial L}{\partial θ_{ℓ}} = \frac{\partial L}{\partial h_{ℓ}} \cdot \frac{\partial f_{ℓ}}{\partial θ_{ℓ}} .

Both quantities are products of vector–Jacobian products, never explicit Jacobian matrices. This is what makes the cost a small constant times the forward pass: each layer needs only the output gradient and its locally-stored activations to produce the input gradient and the parameter gradient.

Worked example: one MLP layer

For a layer $h = ϕ (W x + b)$ , denote $z = W x + b$ and let $g = \partial L / \partial h$ flow in from above. Then

\frac{\partial L}{\partial z} = g ⊙ ϕ^{'} (z), \frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} x^{⊤}, \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z}, \frac{\partial L}{\partial x} = W^{⊤} \frac{\partial L}{\partial z} .

The implementation reads off directly: storing $x$ and $z$ during the forward pass, the backward pass is one element-wise product and two matrix multiplications.

Computational graphs and autodiff

Modern frameworks (PyTorch, JAX, TensorFlow) build a computational graph of forward operations and produce backward code automatically. The two paradigms:

Define-by-run (PyTorch's autograd) — build the graph dynamically as the forward pass executes.
Define-then-run / tracing (JAX grad, TF tf.function) — trace the function once, compile a static graph, run.

Both implement reverse-mode autodiff and reduce to backpropagation when the graph is a feed-forward chain. The autodiff abstraction generalises to RNN unrolls (truncated BPTT), recursive networks, and arbitrary control flow.

Failure modes: vanishing and exploding gradients

For a deep stack, the chain rule multiplies many Jacobians:

\frac{\partial L}{\partial h_{0}} = \frac{\partial L}{\partial h_{L}} \prod_{ℓ = 1}^{L} \frac{\partial f_{ℓ}}{\partial h_{ℓ - 1}} .

If the operator norm of each Jacobian is consistently $> 1$ , gradients explode; if consistently $< 1$ , they vanish. The two structural fixes that made deep training possible — ReLU activations (whose Jacobian eigenvalues are 0 or 1, see activations) and residual connections (whose Jacobian is $I + \partial f / \partial h$ , naturally close to 1) — both target this term. Gradient clipping is the standard runtime patch for explosions, particularly in RNNs.

Backpropagation ​

The setup ​

Forward and backward passes ​

Worked example: one MLP layer ​

Computational graphs and autodiff ​

Failure modes: vanishing and exploding gradients ​

What to read next ​