Normalization Layers
A normalization layer rescales activations so that downstream layers see a stable distribution regardless of how the upstream weights drift during training. The four canonical variants — Batch, Layer, Instance, and Group Norm — differ only in which axes they normalise over. The right choice depends on architecture and batch regime.
Batch Normalization
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe, Szegedy, ICML 2015) was the first and most consequential normalisation technique. For a feature map
Per-channel learned scale
Why it works: BN keeps the optimisation landscape locally smooth — How Does Batch Normalization Help Optimization? (Santurkar et al., NeurIPS 2018) showed the original "internal covariate shift" framing is misleading and that BN's actual effect is to bound the loss-landscape curvature.
Limitations: BN's per-batch statistics break under tiny batches (e.g., per-GPU batch of 2 in detection) and at inference, where the model uses moving-average statistics from training. Both motivate the alternatives below.
Layer Normalization
Layer Normalization (Ba, Kiros, Hinton, 2016) normalises along
LayerNorm has no batch dependence, so train and inference behave identically and small batches don't hurt. It is the universal default in Transformers — every major LLM and ViT uses LayerNorm (or a close variant) inside each block.
Instance and Group Normalization
- Instance Norm (Ulyanov et al., 2016) normalises per-example and per-channel: along
only. Used in style-transfer networks where per-image style statistics matter. - Group Norm (Wu, He, ECCV 2018) groups channels and normalises within each group — along
. Recovers BN's per-channel structure without the batch dependency. The standard recommendation when batch size per device is (detection, segmentation, video).
RMSNorm — modern Transformers
Root Mean Square Layer Normalization (Zhang, Sennrich, NeurIPS 2019) drops the mean-centring step from LayerNorm:
Empirically matches LayerNorm at slightly lower compute. LLaMA, Qwen, and most modern open LLMs use RMSNorm.
Pre-norm vs post-norm
In a Transformer block
Practical defaults
- CNN classification, batch size
— BatchNorm. - CNN with small batches (detection, segmentation, video) — GroupNorm.
- Transformers / LLMs — pre-norm RMSNorm or LayerNorm.
- Style transfer / GANs — InstanceNorm (or AdaIN, conditional variants).
What to read next
- Dropout — the older noise-based regulariser; partially redundant with BN.
- Weight Initialization — normalisation reduces but does not remove the need for good init.
- Transformer (LLM) — pre-norm RMSNorm is part of the modern architecture.