OpenAI o1 & Test-Time Compute

OpenAI released o1-preview and o1-mini on September 12, 2024, with the full o1 following in December. The line introduced a new scaling axis to LLMs: not just model size and training compute, but test-time compute — letting the model "think for longer" via extended internal chain-of-thought. o1 was the model that broke through math and competitive-programming benchmarks that had resisted previous frontier models, and it kicked off the reasoning-model era.

What's different about o1

Previous frontier LLMs (GPT-4, Claude 3, Gemini 1.5) used a fixed amount of compute per query. The model ran for some tokens of output and stopped. Better answers required either a bigger model or better prompting.

o1 broke this pattern. It generates long internal chain-of-thought — typically tens of thousands of tokens of reasoning before producing a short answer. The internal reasoning is hidden from the user (only summarised in the API/UI) but consumes substantial inference compute.

The training: o1 is fine-tuned with reinforcement learning to produce useful chains of thought. The reward signal comes from verifiable correctness on math, code, and other domains where you can check the answer mechanically. See RLVR for the underlying methodology.

Test-time compute scaling

The o1 release came with a new scaling-laws plot: performance vs. inference-time compute. Previous models gave better answers as you scaled training compute. o1 also gives better answers as you scale test-time compute — let it think longer (more tokens of reasoning), get a better answer, in a smooth power-law relationship.

This is qualitatively new. The implication: a smaller model with abundant test-time compute can sometimes match a much larger model running once. The reasoning-model paradigm trades a one-off bigger pretraining run for sustained higher per-query inference cost.

For the user, this means:

Math, code, reasoning-heavy queries become much better, sometimes dramatically so.
Latency goes up — single queries can take 10-60 seconds (or longer) of thinking.
Cost per query goes up — you're paying for tens of thousands of "hidden" reasoning tokens.

Capability discontinuity

o1's capability jumps were striking. On benchmarks where prior frontier models had been roughly stuck:

AIME 2024 (American Invitational Math Exam) — GPT-4o ~13%, o1 ~83%.
Codeforces — o1 reached 89th-percentile competitive-programming rating.
GPQA Diamond (PhD-level science) — GPT-4o ~57%, o1 ~78%, exceeding the human PhD baseline.
Olympiad math, physics, chemistry — o1 matched or beat top human performers on several.

The qualitative reading: o1 was the first model that could reliably do multi-step deductive reasoning with the kind of error-checking and back-tracking humans use on hard problems. The model was visibly thinking in a way previous LLMs were not.

Hidden reasoning, summarised output

OpenAI hid o1's internal reasoning from users for two reasons:

Safety review — the unredacted CoT can include harmful content the model "considered and rejected"; exposing it makes jailbreaking and misuse easier.
Competitive moat — the reasoning traces are training data for distillation; revealing them would let competitors train similar models cheaper.

Users see only a summary (or just the final answer). This was contested — researchers and developers wanted access to the reasoning for debugging and trust. Anthropic's later Claude 3.7 Sonnet "extended thinking" mode chose differently and exposed full reasoning by default.

Subsequent o-series

OpenAI's reasoning-model line evolved rapidly:

o1-pro (Dec 2024) — a higher-compute o1 variant for ChatGPT Pro.
o3 (Dec 2024 announcement, public April 2025) — substantial reasoning improvements; reached frontier on FrontierMath.
o3-mini (Jan 2025) — fast/cheap variant.
o4-mini (April 2025) — multimodal reasoning.
GPT-5 (later 2025) — unified reasoning + general model line, blurring the o-series / GPT-series distinction.

By 2025 nearly every frontier-model provider had a reasoning-model variant: Claude 3.7 Sonnet (extended thinking), Gemini 2.5 (thinking), DeepSeek R1, Qwen QwQ.

What o1 established

Test-time compute as a scaling axis. Not all wins come from training-compute scaling; reasoning models trade inference for capability.
RLVR as the right post-training recipe for reasoning. The teacher signal comes from verifiable rewards, not human preferences.
Reasoning models as a distinct product category. Most providers now ship "fast model + reasoning model" pairs.
Public-benchmark saturation. Many long-standing benchmarks fell to o1; new harder benchmarks (FrontierMath, ARC-AGI 2) became necessary.

What o1 didn't yet have

Tool use during reasoning. o1's reasoning is internal; it doesn't call tools mid-thought (later models do).
Real-time streaming of reasoning to users (intentional choice).
Comparable performance on non-verifiable domains — o1's gains are concentrated on math, code, and science. Open-ended writing, conversation, persuasion don't benefit as much.

OpenAI o1 & Test-Time Compute ​

What's different about o1 ​

Test-time compute scaling ​

Capability discontinuity ​

Hidden reasoning, summarised output ​

Subsequent o-series ​

What o1 established ​

What o1 didn't yet have ​

What to read next ​