Weight Initialization
How you initialise a network's weights determines whether backpropagation sees usable gradients at the first step. The two failure modes are dual: weights too small produce vanishing forward activations and gradients; weights too large produce exploding activations. Modern initialisations target a specific invariant — keep the variance of activations and gradients constant across layers — and derive the right scale from the activation function.
What we want: signal preservation
For a linear layer
For
Xavier (Glorot) initialisation
Understanding the difficulty of training deep feedforward neural networks (Glorot, Bengio, AISTATS 2010) compromised between the forward and backward conditions:
Sampled from a uniform distribution this becomes
He (Kaiming) initialisation
Delving Deep into Rectifiers (He, Zhang, Ren, Sun, ICCV 2015) extended the analysis to ReLU. Because ReLU zeroes half its input,
He init is the default for ReLU networks and is built into PyTorch's kaiming_normal_ / kaiming_uniform_. The same formula extended to leaky ReLU multiplies by
Orthogonal init for RNNs
Recurrent networks apply the same weight matrix
Residual / Transformer initialisation
For residual networks
Practical defaults
- ReLU/LeakyReLU MLPs and CNNs — He normal/uniform with the right fan mode (
fan_infor typical training,fan_outfor transposed conv). - Tanh/Sigmoid networks — Xavier (Glorot).
- Vanilla RNN recurrent matrix — orthogonal.
- Transformer / large LM — small Gaussian (~0.02 std), residual-branch zero init, LR warmup.
The biases default to zero everywhere, except for the LSTM forget gate, which is initialised to 1 to encourage long-term memory at the start of training.
What to read next
- Activation Functions — the choice of
determines the right init. - Backpropagation — what initialisation feeds gradients into.
- Normalization — batch/layer norm makes init less critical, but does not replace it.