T5 & Text-to-Text Framework
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu, JMLR 2020) introduced T5 — a 11B-parameter encoder-decoder Transformer, pretrained on a massive cleaned-web corpus, that frames every NLP task as text-to-text. Translation: input "translate English to German: That is good." → output "Das ist gut." Classification: input "cola sentence: The cat is sleep." → output "unacceptable". The text-to-text framing pre-figured how every modern LLM is now used.
The text-to-text framing
Before T5, NLP tasks had heterogeneous formats: classification needed a softmax head, NER needed per-token labels, QA needed start/end pointers. T5 collapses all of them into one format:
- Input: task prefix + content.
- Output: target text.
Examples from the paper:
cola sentence: ...→acceptable/unacceptable.summarize: <article>→<summary>.translate English to German: <text>→<German text>.stsb sentence1: <a> sentence2: <b>→1.7(similarity, formatted as text).
Even classification labels become tokens — the model emits the literal word "acceptable", not a class index. This eliminates per-task architecture; one model handles every benchmark via prefix prompts.
Architecture
Standard encoder-decoder Transformer. Sizes from T5-small (60M) to T5-11B. Two notable design choices:
- Relative position bias — replace sinusoidal/learned absolute positions with a learned bias added to attention scores based on relative offset. Generalises better to unseen sequence lengths.
- Layer-norm only on inputs (no learnable bias terms) — minor simplification.
T5's encoder-decoder structure is the natural fit for text-to-text tasks: encoder reads the input, decoder generates the output. Decoder-only LLMs eventually subsumed this for most use cases, but encoder-decoder remains stronger when there's a clean input/output split (translation, summarisation).
Pretraining: span corruption
T5 introduced span-corruption as the pretraining objective: replace contiguous spans of tokens with sentinel tokens, train the model to generate the missing spans. This is a generalisation of BERT's MLM that:
- Operates on spans, not single tokens. Masking 15% of spans removes a larger fraction of tokens.
- Has a generative output. The model emits the missing spans in order, separated by sentinel tokens — the same generation interface used downstream.
- Sample-efficient. Each pretraining example contributes a non-trivial generation target.
Span corruption persists in some modern pretraining recipes; the broader idea — pretrain with the same generation interface used at inference — is universal.
C4: the clean corpus
The T5 paper also released C4 (Colossal Clean Crawled Corpus): a 750GB cleaned subset of Common Crawl. The cleaning rules — drop pages with placeholder text, profanity filter, language detection, URL deduplication — set a methodological precedent for downstream pretraining-corpus releases. C4 became the basis of subsequent pretraining datasets and downstream model evaluations.
What T5 demonstrated
The paper is largely an enormous ablation study comparing pretraining objectives, model sizes, dataset sizes, and architectures. Key findings:
- Span corruption beats both MLM and full denoising for downstream transfer.
- Encoder-decoder beats encoder-only and decoder-only at this scale and task mix — though decoder-only would later win at the much larger scales of GPT-3+.
- More pretraining data helps, with diminishing returns.
- The 11B model significantly outperforms 3B, foreshadowing scaling laws.
Legacy
T5's text-to-text framing is now universal — every modern instruction-tuned LLM (ChatGPT, Claude, Gemini) is a "text-to-text model" in T5's sense. The encoder-decoder architecture survives in:
- Translation systems — the natural fit.
- Multilingual T5 (mT5) — covering 101 languages.
- Flan-T5 — instruction-tuned T5; was a strong open baseline before LLaMA.
- Code generation — CodeT5, CodeT5+.
Decoder-only models took over the chat and frontier spaces, but T5's framing — task as text — is what made the transition from per-task heads to LLMs conceptually smooth.
What to read next
- BERT — the encoder-only contemporary.
- GPT-2 — the decoder-only contemporary.
- Scaling Laws — the formalisation of T5's "bigger is better" finding.