GPT-1 — Generative Pretraining

Improving Language Understanding by Generative Pre-Training (Radford, Narasimhan, Salimans, Sutskever, OpenAI 2018) introduced the Generative Pre-Training recipe: pretrain a Transformer decoder on next-token prediction over a large corpus, then fine-tune on labelled tasks. GPT-1 was modest in size (117M parameters) and modest in headline results — but the recipe it established turned out to be the right one. Every modern frontier LLM is a direct descendant of GPT-1's two-step training.

The setup

Two-stage training:

Stage 1 — generative pretraining. Train a 12-layer Transformer decoder on the BooksCorpus (~800M words, mostly fiction) by next-token prediction:

L_{pretrain} (θ) = \sum_{t} \log P_{θ} (x_{t} ∣ x_{< t}) .

The autoregressive objective is the language-model objective in its simplest form. With causal attention masking, a Transformer can be unrolled in parallel during training and one-token-at-a-time during inference.

Stage 2 — supervised fine-tuning. For each downstream task, format the labelled data into the model's input space (with task-specific delimiters) and fine-tune end-to-end with a small task-specific output head:

L_{ft} (θ) = L_{task} + λ L_{pretrain} .

The auxiliary pretraining-loss term keeps the model from forgetting general language modelling during fine-tuning.

Architecture

12-layer Transformer decoder with masked self-attention only — no encoder, no encoder-decoder cross-attention. 768 hidden dim, 12 heads, GELU activation, learned positional embeddings. 117M parameters.

The decoder-only choice was significant: BERT (encoder-only) was contemporaneous, and the two papers' choices represented a fork in pretraining that played out for years.

Results

GPT-1 set new SOTA on 9 of 12 evaluated tasks (NLI, question answering, sentence-similarity, classification). The headline was not the absolute numbers — BERT-large would beat them within months — but the methodology: a single pretrained model, lightly fine-tuned, beat per-task systems engineered with heavy task-specific architecture and feature engineering.

This was the proof-of-concept that generic pretraining transfers in NLP, the same lesson that ImageNet had taught vision in 2012. By 2019 the recipe was universal.

Why decoder-only and autoregressive?

The autoregressive objective has three structural advantages that proved decisive:

Generative. The pretrained model can directly produce text — no architectural change needed for tasks like summarisation, translation, or generation.
Sample-efficient on every token. Every position contributes to the loss. Compare BERT's MLM, where only ~15% of positions count.
Scaling-friendly. Same architecture and objective scales from 100M parameters (GPT-1) to 1T+ today, with the scaling laws showing predictable improvement.

In 2018 these advantages were not obvious — encoder-only BERT initially looked stronger on understanding benchmarks, and encoder-decoder T5 looked cleaner on tasks formatted as text-to-text. The verdict came with GPT-2, GPT-3, and the rest: decoder-only autoregressive scales best.

Input formatting and "task as text"

GPT-1 evaluated on diverse downstream tasks by reformatting each as a token sequence:

Classification — [BOS] document [DELIM] $. Predict label from the last hidden state.
Entailment — [BOS] premise [DELIM] hypothesis $. Predict label from last hidden state.
Multiple choice — for each option, score [BOS] context [DELIM] option $; pick the highest.

This unified-task format was a precursor of T5's text-to-text framing and of GPT-3's zero/few-shot prompting. The same approach generalises to in-context learning at scale.

What GPT-1 didn't have

In-context learning — GPT-1 still required gradient-based fine-tuning per task. Zero-shot capability emerged in GPT-2 and was the headline of GPT-3.
Instruction following — fine-tuning on natural-language instructions came with InstructGPT.
Alignment — preference-based fine-tuning was four years out (RLHF).

What GPT-1 got right that mattered

The choice of decoder-only Transformer + next-token prediction + transfer learning turned out to be the entire architectural commitment for the next 7+ years. GPT-2, GPT-3, GPT-4, Claude, LLaMA, Mistral, Gemini, Qwen — every frontier autoregressive LLM is structurally GPT-1, scaled and refined. The architectural choice in 2018 mattered immensely.

GPT-1 — Generative Pretraining ​

The setup ​

Architecture ​

Results ​

Why decoder-only and autoregressive? ​

Input formatting and "task as text" ​

What GPT-1 didn't have ​

What GPT-1 got right that mattered ​

What to read next ​