The Transformer

The 2017 architecture that replaced both convolutions and recurrence in NLP, then in vision, then in essentially everything. The whole rest of this site is a chronicle of what people built on top of it.

The core idea

A self-attention layer turns a sequence $X \in R^{n \times d}$ into a new sequence by letting every position attend to every other position. Concretely:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

where $Q = X W^{Q}$ , $K = X W^{K}$ , $V = X W^{V}$ are linear projections of the input. The softmax row for position $i$ is a probability distribution over all positions, used to take a weighted sum of the value vectors.

The $\sqrt{d_{k}}$ scaling keeps the dot products from growing too large in high dimensions and saturating the softmax.

Multi-head attention

Run $h$ attention operations in parallel with different projections, concatenate, and project once more:

MHA (X) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} .

Each head can specialise (e.g. one tracks subject–verb agreement, another tracks coreference); empirically the model exploits this without supervision.

A Transformer block

Each block is

text

x ← x + MHA(LayerNorm(x))
x ← x + FFN(LayerNorm(x))

with a position-wise feed-forward network $FFN (x) = W_{2} GELU (W_{1} x + b_{1}) + b_{2}$ . Pre-LN (placing LayerNorm before the sub-layer) has become the standard because it trains stably without warmup tricks.

Positional information

Self-attention is permutation-equivariant — without help, the model can't tell the dog bit the man from the man bit the dog. The original paper added sinusoidal positional encodings; later variants (RoPE, ALiBi — see long-context) inject position directly into $Q$ and $K$ .

Why it won

Parallelism. Unlike RNNs, every position can be processed in parallel — a perfect fit for GPU hardware.
Long-range dependencies. Information moves between any two positions in $O (1)$ layers, not $O (n)$ .
Scaling. The architecture scales smoothly to billions of parameters and trillions of tokens — see Scaling Laws.

Reading list

Attention Is All You Need — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS 2017.

The Transformer ​

The core idea ​

Multi-head attention ​

A Transformer block ​

Positional information ​

Why it won ​

Reading list ​

What to read next ​