The Transformer
The 2017 architecture that replaced both convolutions and recurrence in NLP, then in vision, then in essentially everything. The whole rest of this site is a chronicle of what people built on top of it.
The core idea
A self-attention layer turns a sequence
where
The
Multi-head attention
Run
Each head can specialise (e.g. one tracks subject–verb agreement, another tracks coreference); empirically the model exploits this without supervision.
A Transformer block
Each block is
x ← x + MHA(LayerNorm(x))
x ← x + FFN(LayerNorm(x))with a position-wise feed-forward network
Positional information
Self-attention is permutation-equivariant — without help, the model can't tell the dog bit the man from the man bit the dog. The original paper added sinusoidal positional encodings; later variants (RoPE, ALiBi — see long-context) inject position directly into
Why it won
- Parallelism. Unlike RNNs, every position can be processed in parallel — a perfect fit for GPU hardware.
- Long-range dependencies. Information moves between any two positions in
layers, not . - Scaling. The architecture scales smoothly to billions of parameters and trillions of tokens — see Scaling Laws.
Reading list
- Attention Is All You Need — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS 2017.
What to read next
- Pre-training — what to do with a Transformer once you have one.
- Scaling Laws & Emergent Abilities — why bigger Transformers behave qualitatively differently.