DeepSeek-R1 & Open Reasoning Models
DeepSeek released R1 on January 20, 2025 — an open-weights reasoning model that matched o1 on most public benchmarks, with a detailed technical report explaining how it was trained. R1's release was a discontinuity event: it demonstrated that frontier reasoning capability could be reproduced without proprietary process-reward data, with substantially less training compute, by anyone. The downstream effect was an open-source explosion of reasoning models that lasted through 2025.
What R1 demonstrated
The R1 paper made three claims that mattered:
- Pure outcome-RL can elicit reasoning. No process reward model, no per-step supervision. Reward only the final answer's correctness on math/code/etc; the model figures out long-chain reasoning on its own.
- Open-weights models can match closed-source reasoning models. R1 matched o1 on AIME, MATH, GPQA, and several other reasoning benchmarks.
- Reasoning is distillable. Smaller models trained on R1-generated traces inherit much of R1's reasoning capability — opening the door to laptop-scale reasoning models.
The paper's clean methodology and open release made the reasoning paradigm immediately reproducible by the broader community.
R1-Zero — RL from cold-start
The DeepSeek R1 paper introduced R1-Zero, a training run that started from a pretrained base model and applied only reinforcement learning, no SFT in the middle:
- Base model — DeepSeek-V3-Base, a 671B-total / 37B-active MoE Transformer.
- Reward — verifiable rule-based rewards on math (correct answer) and code (passing test cases) plus a small format reward (output the reasoning between tags).
- GRPO — Group Relative Policy Optimization, DeepSeek's PPO variant that doesn't require a value model. Sample
candidate completions per prompt; reward each by relative rank within the group.
Result: the model spontaneously developed long-chain reasoning with self-verification, backtracking, and "aha moments" where it would catch its own errors mid-trace. With pure outcome reward and no process supervision, the model invented the reasoning patterns o1 had been hand-trained to use.
R1-Zero's emergence was the strongest evidence yet that long-chain reasoning is an RL-elicitable behaviour of pretrained LLMs at frontier scale, not something that requires careful per-step supervision.
R1 — practical reasoning model
R1-Zero had problems for actual deployment: low readability, language mixing, formatting issues. R1 fixes these via a multi-stage pipeline:
- Cold-start SFT — fine-tune on a small corpus of high-quality reasoning traces (collected, filtered).
- Reasoning-RL — like R1-Zero, RL on math/code/STEM.
- Rejection sampling + SFT — sample many traces, filter for correctness and quality, fine-tune the base model.
- General-RLHF — final pass for helpfulness and safety on broader tasks.
The output is R1: a usable reasoning model that retains R1-Zero's deep capability but with formatting and behaviour suitable for production.
Distillation
The R1 paper also released R1-distilled models — Llama 3.3 70B, Qwen 2.5 32B, and several smaller sizes fine-tuned on R1-generated reasoning traces. These distilled models inherited a substantial fraction of R1's reasoning capability at much smaller scale and inference cost. Distilled R1-7B and R1-1.5B were the first reasoning models small enough to run on consumer hardware.
Within weeks of R1's release, the open-source community had:
- R1-fine-tunes of every major open base model.
- Lightweight reproductions of the R1 training pipeline (Open-R1 by Hugging Face, others).
- Domain-specific reasoning models (medical, legal, multilingual).
What this changed
R1's release was a methodological commons moment. Before R1:
- Reasoning models were an OpenAI/closed-source phenomenon.
- The methodology was heavily speculated about, not known.
- Reproducing required a frontier-lab budget.
After R1:
- The training recipe was public.
- Open-weights reasoning models were freely available.
- A 7B reasoning model could run on a laptop.
The competitive implication: closed-source labs lost a significant moat. Frontier reasoning models became a feature one could expect from any major LLM provider, not a proprietary advantage.
What R1 still didn't have
- Tool use during reasoning. R1 reasoned internally; it didn't call search engines or code execution mid-thought. (Search-R1 and follow-ups added this.)
- Multimodal reasoning. R1 was text-only.
- Long-horizon agentic capability. R1 reasoned for tens of thousands of tokens, but didn't take real-world actions.
These were addressed by subsequent open and closed work in 2025.
What R1 settled
- Outcome-RL is sufficient for eliciting reasoning at scale, given a strong base model.
- The reasoning-model paradigm is not a moat. Both methodologically (the recipe is published) and practically (open-weights matches closed).
- Distillation works for reasoning. Reasoning-trained traces are remarkably effective fine-tuning data.
What to read next
- o1 — the closed-source predecessor.
- Process Rewards — the alternative approach R1 didn't need.
- RLVR (LLM track) — the broader curriculum-track explanation.