Whisper & Speech
Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, OpenAI, Sept 2022) introduced Whisper, a single Transformer that performs ASR (automatic speech recognition), translation, voice activity detection, and language identification across ~100 languages, all trained jointly on 680,000 hours of multilingual web audio. Whisper was the moment "general-purpose speech model" became an off-the-shelf product, and the benchmark every later open speech model had to beat.
The setup
Whisper is an encoder-decoder Transformer operating on log-Mel spectrograms:
- Input — 30-second audio chunks converted to 80-channel log-Mel spectrograms.
- Encoder — a sequence-to-sequence Transformer encoder (no recurrence, no convolution beyond the input stem).
- Decoder — autoregressive Transformer decoder that emits text tokens, with task and language control via special tokens (
<|en|>,<|transcribe|>,<|translate|>).
The architecture is unremarkable; Whisper's contribution is what to train it on, not how.
Weak supervision at scale
Conventional ASR training used heavily curated, hand-transcribed corpora — LibriSpeech (1000 hours), CommonVoice (a few thousand hours per language). Whisper went the opposite direction: 680K hours of audio scraped from the web with noisy, weakly-aligned transcripts.
The training data:
- Mix of speech, music, ambient noise, multiple speakers, accents, code-switching.
- Transcripts of variable quality — some clean, some autogenerated, some misaligned.
- ~117K hours of multilingual data covering 96 non-English languages.
- ~125K hours of (X→English) translation pairs from the same multilingual sources.
The bet — scale + diversity beats curation — paid off. Whisper outperformed previous SOTA on most benchmarks without any benchmark-specific fine-tuning, and degraded much less on out-of-distribution audio than supervised baselines (different accents, noisy conditions, technical jargon).
Multitask training
Whisper is trained to do, in a single model:
- Transcription — speech → text in the spoken language.
- Translation — non-English speech → English text.
- Voice activity detection — silence vs speech, used as control tokens.
- Language identification — predict the language token.
- Timestamps — time-aligned word boundaries (in larger versions).
Tasks are selected via prompt tokens at the decoder start. This is the same recipe T5 uses — task as text — applied to speech.
Sizes and releases
Whisper was released in five sizes from tiny (39M params) to large (1550M). All sizes were released open-weight under MIT licence. This was significant because:
- Speech models had previously been research-only or commercial-API-only.
- Whisper-large running locally on a consumer GPU enabled an entire ecosystem of subtitle generators, transcription tools, podcast indexers.
- The HuggingFace community quickly fine-tuned Whisper for niche domains (medical, legal, broadcast).
Subsequent releases — Whisper-v2 (Dec 2022), Whisper-v3 (Nov 2023), Whisper-v3-turbo (Sept 2024) — improved quality and inference speed.
Robustness
The Whisper paper devotes most of its evaluation to out-of-distribution robustness:
- On LibriSpeech test-other (cleaner): Whisper-large is competitive with supervised SOTA.
- On CHiME-6 (noisy meetings): Whisper-large beats supervised baselines by a wide margin.
- On VoxPopuli (parliamentary speech, accents): same pattern.
- Across many languages: Whisper-large is the SOTA for languages with little curated training data.
The pattern is consistent: weakly-supervised training on huge web audio beats narrowly-supervised training on a curated corpus, especially in distribution-shifted settings. This is the same lesson CLIP taught for vision.
What Whisper enabled
- Open-source speech infrastructure. Faster-Whisper, WhisperX, Whisper.cpp — efficient inference variants that run on consumer hardware including phones.
- Voice interfaces for LLMs. Whisper feeds speech into LLMs; downstream products (ChatGPT voice mode, Claude voice mode) use Whisper-derived ASR.
- Real-time captioning, podcasts, accessibility. A new generation of products.
- Speech LLM training data. Whisper-transcribed audio is the source of much of the speech text used to pretrain newer speech-aware LLMs.
What's next
The post-Whisper era is moving toward end-to-end speech LLMs (GPT-4o, Gemini 1.5, Moshi, AudioLM) that ingest raw audio and produce raw audio without going through text. Whisper's encoder-decoder split is being subsumed by audio-token-mixed Transformers. But Whisper's data approach — weakly-supervised pretraining at 100K+ hour scale — is the recipe everyone copies.
What to read next
- Multi-Modal LLMs (LLM) — how speech models integrate with LLMs.
- GPT-4o — the modern end-to-end audio successor.
- Frontier Models — speech as part of the frontier multimodal stack.