Skip to content

Whisper & Speech

Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, OpenAI, Sept 2022) introduced Whisper, a single Transformer that performs ASR (automatic speech recognition), translation, voice activity detection, and language identification across ~100 languages, all trained jointly on 680,000 hours of multilingual web audio. Whisper was the moment "general-purpose speech model" became an off-the-shelf product, and the benchmark every later open speech model had to beat.

The setup

Whisper is an encoder-decoder Transformer operating on log-Mel spectrograms:

  • Input — 30-second audio chunks converted to 80-channel log-Mel spectrograms.
  • Encoder — a sequence-to-sequence Transformer encoder (no recurrence, no convolution beyond the input stem).
  • Decoder — autoregressive Transformer decoder that emits text tokens, with task and language control via special tokens (<|en|>, <|transcribe|>, <|translate|>).

The architecture is unremarkable; Whisper's contribution is what to train it on, not how.

Weak supervision at scale

Conventional ASR training used heavily curated, hand-transcribed corpora — LibriSpeech (1000 hours), CommonVoice (a few thousand hours per language). Whisper went the opposite direction: 680K hours of audio scraped from the web with noisy, weakly-aligned transcripts.

The training data:

  • Mix of speech, music, ambient noise, multiple speakers, accents, code-switching.
  • Transcripts of variable quality — some clean, some autogenerated, some misaligned.
  • ~117K hours of multilingual data covering 96 non-English languages.
  • ~125K hours of (X→English) translation pairs from the same multilingual sources.

The bet — scale + diversity beats curation — paid off. Whisper outperformed previous SOTA on most benchmarks without any benchmark-specific fine-tuning, and degraded much less on out-of-distribution audio than supervised baselines (different accents, noisy conditions, technical jargon).

Multitask training

Whisper is trained to do, in a single model:

  • Transcription — speech → text in the spoken language.
  • Translation — non-English speech → English text.
  • Voice activity detection — silence vs speech, used as control tokens.
  • Language identification — predict the language token.
  • Timestamps — time-aligned word boundaries (in larger versions).

Tasks are selected via prompt tokens at the decoder start. This is the same recipe T5 uses — task as text — applied to speech.

Sizes and releases

Whisper was released in five sizes from tiny (39M params) to large (1550M). All sizes were released open-weight under MIT licence. This was significant because:

  • Speech models had previously been research-only or commercial-API-only.
  • Whisper-large running locally on a consumer GPU enabled an entire ecosystem of subtitle generators, transcription tools, podcast indexers.
  • The HuggingFace community quickly fine-tuned Whisper for niche domains (medical, legal, broadcast).

Subsequent releases — Whisper-v2 (Dec 2022), Whisper-v3 (Nov 2023), Whisper-v3-turbo (Sept 2024) — improved quality and inference speed.

Robustness

The Whisper paper devotes most of its evaluation to out-of-distribution robustness:

  • On LibriSpeech test-other (cleaner): Whisper-large is competitive with supervised SOTA.
  • On CHiME-6 (noisy meetings): Whisper-large beats supervised baselines by a wide margin.
  • On VoxPopuli (parliamentary speech, accents): same pattern.
  • Across many languages: Whisper-large is the SOTA for languages with little curated training data.

The pattern is consistent: weakly-supervised training on huge web audio beats narrowly-supervised training on a curated corpus, especially in distribution-shifted settings. This is the same lesson CLIP taught for vision.

What Whisper enabled

  • Open-source speech infrastructure. Faster-Whisper, WhisperX, Whisper.cpp — efficient inference variants that run on consumer hardware including phones.
  • Voice interfaces for LLMs. Whisper feeds speech into LLMs; downstream products (ChatGPT voice mode, Claude voice mode) use Whisper-derived ASR.
  • Real-time captioning, podcasts, accessibility. A new generation of products.
  • Speech LLM training data. Whisper-transcribed audio is the source of much of the speech text used to pretrain newer speech-aware LLMs.

What's next

The post-Whisper era is moving toward end-to-end speech LLMs (GPT-4o, Gemini 1.5, Moshi, AudioLM) that ingest raw audio and produce raw audio without going through text. Whisper's encoder-decoder split is being subsumed by audio-token-mixed Transformers. But Whisper's data approach — weakly-supervised pretraining at 100K+ hour scale — is the recipe everyone copies.

Released under the MIT License. Content imported and adapted from NoteNextra.