Normalization Layers

A normalization layer rescales activations so that downstream layers see a stable distribution regardless of how the upstream weights drift during training. The four canonical variants — Batch, Layer, Instance, and Group Norm — differ only in which axes they normalise over. The right choice depends on architecture and batch regime.

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe, Szegedy, ICML 2015) was the first and most consequential normalisation technique. For a feature map $x \in R^{N \times C \times H \times W}$ , BN normalises along $(N, H, W)$ — across the batch and spatial dimensions — independently for each channel:

μ_{c} = \frac{1}{N H W} \sum_{n, h, w} x_{n, c, h, w}, σ_{c}^{2} = \frac{1}{N H W} \sum_{n, h, w} (x_{n, c, h, w} - μ_{c})^{2},

{\hat{x}}_{n, c, h, w} = \frac{x_{n, c, h, w} - μ_{c}}{\sqrt{σ_{c}^{2} + ϵ}}, y_{n, c, h, w} = γ_{c} {\hat{x}}_{n, c, h, w} + β_{c} .

Per-channel learned scale $γ_{c}$ and shift $β_{c}$ restore representational power.

Why it works: BN keeps the optimisation landscape locally smooth — How Does Batch Normalization Help Optimization? (Santurkar et al., NeurIPS 2018) showed the original "internal covariate shift" framing is misleading and that BN's actual effect is to bound the loss-landscape curvature.

Limitations: BN's per-batch statistics break under tiny batches (e.g., per-GPU batch of 2 in detection) and at inference, where the model uses moving-average statistics from training. Both motivate the alternatives below.

Layer Normalization

Layer Normalization (Ba, Kiros, Hinton, 2016) normalises along $(C, H, W)$ instead — across all features within a single example:

μ_{n} = \frac{1}{C H W} \sum_{c, h, w} x_{n, c, h, w}, σ_{n}^{2} = \frac{1}{C H W} \sum_{c, h, w} (x_{n, c, h, w} - μ_{n})^{2} .

LayerNorm has no batch dependence, so train and inference behave identically and small batches don't hurt. It is the universal default in Transformers — every major LLM and ViT uses LayerNorm (or a close variant) inside each block.

Instance and Group Normalization

Instance Norm (Ulyanov et al., 2016) normalises per-example and per-channel: along $(H, W)$ only. Used in style-transfer networks where per-image style statistics matter.
Group Norm (Wu, He, ECCV 2018) groups channels and normalises within each group — along $(H, W, C / g)$ . Recovers BN's per-channel structure without the batch dependency. The standard recommendation when batch size per device is $\leq 4$ (detection, segmentation, video).

RMSNorm — modern Transformers

Root Mean Square Layer Normalization (Zhang, Sennrich, NeurIPS 2019) drops the mean-centring step from LayerNorm:

y = γ \cdot \frac{x}{\sqrt{\frac{1}{d} \sum_{i} x_{i}^{2} + ϵ}} .

Empirically matches LayerNorm at slightly lower compute. LLaMA, Qwen, and most modern open LLMs use RMSNorm.

Pre-norm vs post-norm

In a Transformer block $y = x + f (Norm (x))$ (pre-norm) vs $y = Norm (x + f (x))$ (post-norm), the placement matters at scale. Pre-norm lets gradients flow through the residual unchanged and is what makes training stable for very deep Transformers. The original Attention is All You Need used post-norm; every modern LLM uses pre-norm.

Practical defaults

CNN classification, batch size $\geq 16$ — BatchNorm.
CNN with small batches (detection, segmentation, video) — GroupNorm.
Transformers / LLMs — pre-norm RMSNorm or LayerNorm.
Style transfer / GANs — InstanceNorm (or AdaIN, conditional variants).

Normalization Layers ​

Batch Normalization ​

Layer Normalization ​

Instance and Group Normalization ​

RMSNorm — modern Transformers ​

Pre-norm vs post-norm ​

Practical defaults ​

What to read next ​