Linear Attention — Hyena, RetNet

The 2023 wave of "linear attention" architectures — Hyena, RetNet, GLA, and others — revisited the efficient-attention idea with a sharper pitch: Transformer-quality at $O (T)$ training cost, with $O (1)$ per-token state at inference. They're closely related to SSMs / Mamba — the Mamba-2 paper showed selective SSMs and certain linear attentions are mathematically dual — and together form the linear-recurrent challenger to the Transformer.

Linear attention, from the kernel view

Recall that scaled dot-product attention is

Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V .

If we replace the softmax-of-dot-product kernel with any kernel $ϕ$ that decomposes as $ϕ (q, k) = ⟨ ψ (q), ψ (k) ⟩$ , attention becomes

{LinAttn}_{t} = \frac{\sum_{s \leq t} ⟨ ψ (q_{t}), ψ (k_{s}) ⟩ v_{s}}{\sum_{s \leq t} ⟨ ψ (q_{t}), ψ (k_{s}) ⟩} .

By associativity, the inner sums $\sum_{s} ψ (k_{s}) v_{s}^{⊤}$ and $\sum_{s} ψ (k_{s})$ can be accumulated incrementally — they're constant-size matrices, not $T \times T$ matrices. Inference becomes $O (1)$ per token.

The catch: linear attention is generally less expressive than softmax attention. The softmax's hard selection between strongly- and weakly-matching keys is structurally lost, and naive linear attention underperforms.

Hyena (2023)

Hyena Hierarchy: Towards Larger Convolutional Language Models (Poli, Massaroli, Nguyen et al., ICML 2023). Hyena replaces attention with a stack of long-range convolutions parameterised in the frequency domain — closely related to S4 but with a different parametrisation.

The Hyena operator, simplified:

For each layer, three projections $q, k, v$ as in Transformers.
Replace the $Q K^{⊤}$ similarity with element-wise multiplication + a long convolution with implicitly-parameterised kernel.
Output is $q ⊙ (h * (k ⊙ v))$ , where $h$ is the long-range convolution kernel.

The long convolutions are parameterised by a small MLP indexed by position, then evaluated efficiently in the frequency domain via FFT. The result is sub-quadratic ( $O (T \log T)$ ) attention substitute that scales to long sequences.

Hyena models match Transformers at sequence-modelling perplexity at the small-medium scale (155M-1B). At frontier scale, the architecture has been less explored.

RetNet — Retentive Network

Retentive Network: A Successor to Transformer for Large Language Models (Sun, Dong et al., Microsoft 2023). RetNet's contribution is offering three computational views of the same recurrent operation:

Parallel view — for training, similar cost to attention.
Recurrent view — for inference, $O (1)$ per-token state, like an RNN.
Chunkwise view — for long-context training, $O (T)$ scaling.

The retention operator:

Retention (Q, K, V) = (Q K^{⊤} ⊙ D) V, D_{i j} = γ^{i - j} \cdot [i \geq j] .

The matrix $D$ is a causal mask with exponential decay $γ$ . Different heads use different $γ$ values, giving different time-decay rates analogous to multi-head attention's specialisation.

Empirically, RetNet matched Transformers at modest scale and offered substantial inference-throughput improvements due to the constant-state recurrent view. Its production deployment has been smaller than Mamba's, but the multi-view-of-the-same-op idea recurred in later work.

GLA — Gated Linear Attention

Gated Linear Attention Transformers with Hardware-Efficient Training (Yang, Wang, Hu, Wang, Zhang et al., ICML 2024). GLA adds input-dependent gating to linear attention — the same selectivity idea that made Mamba competitive applied to linear attention. Each step, gates control how much of the running state is retained.

GLA's recurrence is roughly:

S_{t} = G_{t} ⊙ S_{t - 1} + k_{t} v_{t}^{⊤}, o_{t} = q_{t} S_{t},

with $G_{t}$ an input-dependent forget gate. The result is selective linear attention that competes with Transformers at language modelling.

GLA is the architecture in DeltaNet, RWKV-7, and several frontier hybrid models in 2024-25.

What linear attention is for

The linear-attention family is winning on:

Long-sequence efficiency. $O (T)$ training and $O (1)$ per-token inference are real wins past 100K-token contexts.
Streaming inference. Constant-state recurrent inference suits voice agents, real-time analytics, low-latency agents.
Hybrid architectures. Modern hybrid stacks (Jamba, Zamba, Granite Mamba) interleave Transformer attention with linear-attention or SSM blocks.

It's not yet winning on:

Frontier-scale pure architectures. Pure linear-attention models at 100B+ are still rare.
In-context learning. Linear-attention ICL lags Transformer ICL at comparable scale, though closing.
Exact retrieval. Constant-size state cannot precisely recall arbitrary earlier tokens.

Linear attention vs SSMs vs Transformers

The 2024-25 architectural landscape has three challengers:

Transformers — expressive, ICL-strong, ecosystem-mature, but $O (T^{2})$ .
SSMs (Mamba/Mamba-2) — efficient, content-aware, state-of-the-art for long sequences, but ICL-weaker.
Linear attention (GLA, RWKV-7) — efficient, similar trade-offs to SSMs, mathematically dual to selective SSMs in some forms.

The dominant production answer in 2025 is hybrids: Transformer attention layers (for retrieval and ICL) plus linear-recurrent layers (for efficiency on long contexts).

Linear Attention — Hyena, RetNet ​

Linear attention, from the kernel view ​

Hyena (2023) ​

RetNet — Retentive Network ​

GLA — Gated Linear Attention ​

What linear attention is for ​

Linear attention vs SSMs vs Transformers ​

What to read next ​