RoPE, ALiBi & Position Extension
The original Transformer's sinusoidal positional encodings had a known weakness — they struggle to extrapolate beyond training-set sequence lengths. The 2021–2024 generation of LLMs replaced them with two main alternatives: Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi). RoPE won at frontier scale; both gave rise to a literature on post-hoc context extension that lets a model trained at 4K context serve at 32K, 128K, or longer.
Why position matters
Self-attention is permutation-equivariant — it treats input tokens as a set. To reason about order ("the cat sat on the mat" ≠ "mat the on sat cat the"), a Transformer needs positional information injected somewhere. The choices:
- Absolute — each position
gets a fixed or learned vector added to the embedding. - Relative — attention scores are biased by a function of
. - Rotary — apply position-dependent rotations in
subspaces.
Each has different generalisation behaviour at lengths beyond training.
RoPE — Rotary Position Embeddings
RoFormer: Enhanced Transformer with Rotary Position Embedding (Su, Lu, Pan, Murtadha, Wen, Liu, 2021). The construction:
For a position
Frequencies
The clever property: the dot product of rotated
So RoPE encodes relative positions while staying compatible with standard attention computation. No modification to the attention math; you just rotate
RoPE is the default in LLaMA, Mistral, Qwen, DeepSeek, Phi, Yi, and most modern open and closed LLMs. It is the most-deployed position-encoding scheme of the 2020s.
ALiBi — Attention with Linear Biases
ALiBi (Press, Smith, Lewis, ICLR 2022). Skip positional embeddings entirely; bias attention scores by a function of the relative offset:
with
ALiBi was simpler than RoPE and showed strong extrapolation properties — a model trained at 1024 tokens could be served at 4096 tokens with minimal quality loss. MPT and BLOOM used it. It largely lost to RoPE for in-distribution quality but remains a clean reference point for out-of-distribution generalisation.
Why RoPE won and how it failed at long context
RoPE wins on quality at training-distribution lengths. But naively, RoPE also extrapolates poorly beyond training: at positions much larger than seen during training, the rotation frequencies create attention patterns the model wasn't trained to handle.
Several context-extension techniques solve this:
- Position Interpolation (Chen et al., Meta 2023) — at inference, rescale positions so a
-longer sequence still falls in . Works surprisingly well after a brief fine-tune. - NTK-aware RoPE — change the base of the frequency formula so longer sequences sample frequencies that aren't unfamiliar to the trained model.
- YaRN (Peng, Quesnelle, Fan, Shippole, 2023) — combines NTK-aware scaling with a per-frequency interpolation strategy. Used in many production long-context models.
- LongRoPE (Microsoft 2024) — search-based discovery of an optimal frequency-rescale schedule, extending models to 2M+ tokens.
These post-hoc methods made it cheap to extend a model trained at 4K or 8K to 128K+ — Llama 3.1's 128K context, Qwen's 1M context, and many others use one of these techniques.
Long-context training
Modern frontier models (Gemini 1.5, GPT-4 Turbo 128K, Claude 3 200K) train explicitly at long context for portions of training. Combined with RoPE + interpolation tricks, this routinely extends usable context to 100K–1M+ tokens.
The remaining challenges aren't really about position encoding any more — they're about long-context attention efficiency, retrieval-quality benchmarking (needle-in-haystack tests), and effective use of long context (most user prompts don't reach it).
What to read next
- Long-Context Transformers — the broader engineering story.
- Self-Attention — the absolute-position baseline.
- LLaMA — RoPE deployed at frontier scale.