Alternative Architectures
The dense decoder-only Transformer dominates, but several alternative architectures aim at its two structural costs:
Mixture-of-Experts (MoE)
A MoE layer replaces a dense feed-forward block with
Only
Mamba and Structured State-Space Duality
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2024) builds on the S4 line of state-space models but adds input-dependent state-update parameters, making the recurrence content-aware:
Inference is
Mamba models match Transformers on perplexity at small-medium scale and beat them on extreme-long-context retrieval — but they lag on tasks requiring exact-match copying, where attention's content-addressable memory is structurally better.
RWKV
RWKV (Peng et al., 2023) is another linear-attention/RNN hybrid, designed to be trainable like a Transformer and inferrable like an RNN. Self-attention is replaced by a time-mixing block with channel-wise weighted-key-value (WKV) interpolation, expressible both as a parallel matrix operation (training) and a constant-memory recurrence (inference). RWKV was the first non-Transformer to scale to 14B parameters with competitive quality, and the community has continued to release Eagle / Finch / RWKV-7 variants.
Hierarchical Reasoning Model
Hierarchical Reasoning Model (Wang et al., 2024) departs from the "single deep stack of identical blocks" template. Instead, two networks operate at different timescales — a fast low-level module (per-step reasoning) and a slow high-level module (planning, abstraction) — communicating via state passed at the slow module's tick rate. The architecture explicitly structures the model around subproblem decomposition rather than relying on the chain-of-thought prompt to do it implicitly. Early benchmarks show strong results on combinatorial reasoning tasks (Sudoku, ARC) at small scales; the open question is whether the inductive bias survives scaling.
Reading list
- Mixtral of Experts — Jiang, Sablayrolles, Roux et al., 2024 (Mixtral 8x7B).
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu, Dao, COLM 2024.
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Dao, Gu, ICML 2024 (Mamba-2 / SSD).
- RWKV: Reinventing RNNs for the Transformer Era — Peng et al., EMNLP Findings 2023.
- Hierarchical Reasoning Model — Wang et al., 2024.
What to read next
- Long-Context Transformers — the scaling-laws-friendly path to long context.
- Inference Optimisation — KV-cache tricks that MoE and SSM both reduce or remove.
- The Transformer — the baseline these designs depart from.