Skip to content

DeepSeek-R1 & Open Reasoning Models

DeepSeek released R1 on January 20, 2025 — an open-weights reasoning model that matched o1 on most public benchmarks, with a detailed technical report explaining how it was trained. R1's release was a discontinuity event: it demonstrated that frontier reasoning capability could be reproduced without proprietary process-reward data, with substantially less training compute, by anyone. The downstream effect was an open-source explosion of reasoning models that lasted through 2025.

What R1 demonstrated

The R1 paper made three claims that mattered:

  • Pure outcome-RL can elicit reasoning. No process reward model, no per-step supervision. Reward only the final answer's correctness on math/code/etc; the model figures out long-chain reasoning on its own.
  • Open-weights models can match closed-source reasoning models. R1 matched o1 on AIME, MATH, GPQA, and several other reasoning benchmarks.
  • Reasoning is distillable. Smaller models trained on R1-generated traces inherit much of R1's reasoning capability — opening the door to laptop-scale reasoning models.

The paper's clean methodology and open release made the reasoning paradigm immediately reproducible by the broader community.

R1-Zero — RL from cold-start

The DeepSeek R1 paper introduced R1-Zero, a training run that started from a pretrained base model and applied only reinforcement learning, no SFT in the middle:

  1. Base model — DeepSeek-V3-Base, a 671B-total / 37B-active MoE Transformer.
  2. Reward — verifiable rule-based rewards on math (correct answer) and code (passing test cases) plus a small format reward (output the reasoning between tags).
  3. GRPO — Group Relative Policy Optimization, DeepSeek's PPO variant that doesn't require a value model. Sample G candidate completions per prompt; reward each by relative rank within the group.

Result: the model spontaneously developed long-chain reasoning with self-verification, backtracking, and "aha moments" where it would catch its own errors mid-trace. With pure outcome reward and no process supervision, the model invented the reasoning patterns o1 had been hand-trained to use.

R1-Zero's emergence was the strongest evidence yet that long-chain reasoning is an RL-elicitable behaviour of pretrained LLMs at frontier scale, not something that requires careful per-step supervision.

R1 — practical reasoning model

R1-Zero had problems for actual deployment: low readability, language mixing, formatting issues. R1 fixes these via a multi-stage pipeline:

  1. Cold-start SFT — fine-tune on a small corpus of high-quality reasoning traces (collected, filtered).
  2. Reasoning-RL — like R1-Zero, RL on math/code/STEM.
  3. Rejection sampling + SFT — sample many traces, filter for correctness and quality, fine-tune the base model.
  4. General-RLHF — final pass for helpfulness and safety on broader tasks.

The output is R1: a usable reasoning model that retains R1-Zero's deep capability but with formatting and behaviour suitable for production.

Distillation

The R1 paper also released R1-distilled models — Llama 3.3 70B, Qwen 2.5 32B, and several smaller sizes fine-tuned on R1-generated reasoning traces. These distilled models inherited a substantial fraction of R1's reasoning capability at much smaller scale and inference cost. Distilled R1-7B and R1-1.5B were the first reasoning models small enough to run on consumer hardware.

Within weeks of R1's release, the open-source community had:

  • R1-fine-tunes of every major open base model.
  • Lightweight reproductions of the R1 training pipeline (Open-R1 by Hugging Face, others).
  • Domain-specific reasoning models (medical, legal, multilingual).

What this changed

R1's release was a methodological commons moment. Before R1:

  • Reasoning models were an OpenAI/closed-source phenomenon.
  • The methodology was heavily speculated about, not known.
  • Reproducing required a frontier-lab budget.

After R1:

  • The training recipe was public.
  • Open-weights reasoning models were freely available.
  • A 7B reasoning model could run on a laptop.

The competitive implication: closed-source labs lost a significant moat. Frontier reasoning models became a feature one could expect from any major LLM provider, not a proprietary advantage.

What R1 still didn't have

  • Tool use during reasoning. R1 reasoned internally; it didn't call search engines or code execution mid-thought. (Search-R1 and follow-ups added this.)
  • Multimodal reasoning. R1 was text-only.
  • Long-horizon agentic capability. R1 reasoned for tens of thousands of tokens, but didn't take real-world actions.

These were addressed by subsequent open and closed work in 2025.

What R1 settled

  • Outcome-RL is sufficient for eliciting reasoning at scale, given a strong base model.
  • The reasoning-model paradigm is not a moat. Both methodologically (the recipe is published) and practically (open-weights matches closed).
  • Distillation works for reasoning. Reasoning-trained traces are remarkably effective fine-tuning data.

Released under the MIT License. Content imported and adapted from NoteNextra.