RoPE, ALiBi & Position Extension

The original Transformer's sinusoidal positional encodings had a known weakness — they struggle to extrapolate beyond training-set sequence lengths. The 2021–2024 generation of LLMs replaced them with two main alternatives: Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi). RoPE won at frontier scale; both gave rise to a literature on post-hoc context extension that lets a model trained at 4K context serve at 32K, 128K, or longer.

Why position matters

Self-attention is permutation-equivariant — it treats input tokens as a set. To reason about order ("the cat sat on the mat" ≠ "mat the on sat cat the"), a Transformer needs positional information injected somewhere. The choices:

Absolute — each position $p$ gets a fixed or learned vector added to the embedding.
Relative — attention scores are biased by a function of $i - j$ .
Rotary — apply position-dependent rotations in $Q, K$ subspaces.

Each has different generalisation behaviour at lengths beyond training.

RoPE — Rotary Position Embeddings

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su, Lu, Pan, Murtadha, Wen, Liu, 2021). The construction:

For a position $p$ and embedding pair $(x_{2 i}, x_{2 i + 1})$ , apply a 2D rotation by angle $p θ_{i}$ :

(\begin{matrix} x_{2 i}^{'} \\ x_{2 i + 1}^{'} \end{matrix}) = (\begin{matrix} \cos (p θ_{i}) & - \sin (p θ_{i}) \\ \sin (p θ_{i}) & \cos (p θ_{i}) \end{matrix}) (\begin{matrix} x_{2 i} \\ x_{2 i + 1} \end{matrix}) .

Frequencies $θ_{i} = 10000^{- 2 i / d}$ , exactly like the original sinusoidal positional encoding's frequency schedule.

The clever property: the dot product of rotated $Q$ and rotated $K$ depends only on the relative offset:

⟨ R_{p} q, R_{p^{'}} k ⟩ = ⟨ q, R_{p^{'} - p} k ⟩ .

So RoPE encodes relative positions while staying compatible with standard attention computation. No modification to the attention math; you just rotate $Q$ and $K$ before the dot product.

RoPE is the default in LLaMA, Mistral, Qwen, DeepSeek, Phi, Yi, and most modern open and closed LLMs. It is the most-deployed position-encoding scheme of the 2020s.

ALiBi — Attention with Linear Biases

ALiBi (Press, Smith, Lewis, ICLR 2022). Skip positional embeddings entirely; bias attention scores by a function of the relative offset:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}} - m \cdot | i - j |) V,

with $m$ a head-specific slope. Closer pairs get higher attention scores; far-apart pairs get a linear penalty.

ALiBi was simpler than RoPE and showed strong extrapolation properties — a model trained at 1024 tokens could be served at 4096 tokens with minimal quality loss. MPT and BLOOM used it. It largely lost to RoPE for in-distribution quality but remains a clean reference point for out-of-distribution generalisation.

Why RoPE won and how it failed at long context

RoPE wins on quality at training-distribution lengths. But naively, RoPE also extrapolates poorly beyond training: at positions much larger than seen during training, the rotation frequencies create attention patterns the model wasn't trained to handle.

Several context-extension techniques solve this:

Position Interpolation (Chen et al., Meta 2023) — at inference, rescale positions so a $4 \times$ -longer sequence still falls in $[0, L_{train}]$ . Works surprisingly well after a brief fine-tune.
NTK-aware RoPE — change the base of the frequency formula so longer sequences sample frequencies that aren't unfamiliar to the trained model.
YaRN (Peng, Quesnelle, Fan, Shippole, 2023) — combines NTK-aware scaling with a per-frequency interpolation strategy. Used in many production long-context models.
LongRoPE (Microsoft 2024) — search-based discovery of an optimal frequency-rescale schedule, extending models to 2M+ tokens.

These post-hoc methods made it cheap to extend a model trained at 4K or 8K to 128K+ — Llama 3.1's 128K context, Qwen's 1M context, and many others use one of these techniques.

Long-context training

Modern frontier models (Gemini 1.5, GPT-4 Turbo 128K, Claude 3 200K) train explicitly at long context for portions of training. Combined with RoPE + interpolation tricks, this routinely extends usable context to 100K–1M+ tokens.

The remaining challenges aren't really about position encoding any more — they're about long-context attention efficiency, retrieval-quality benchmarking (needle-in-haystack tests), and effective use of long context (most user prompts don't reach it).

RoPE, ALiBi & Position Extension ​

Why position matters ​

RoPE — Rotary Position Embeddings ​

ALiBi — Attention with Linear Biases ​

Why RoPE won and how it failed at long context ​

Long-context training ​

What to read next ​