DDPM & Score-Based Models
Denoising Diffusion Probabilistic Models (Ho, Jain, Abbeel, NeurIPS 2020) and Score-Based Generative Modeling through Stochastic Differential Equations (Song et al., ICLR 2021) crystallised diffusion as a generative paradigm. The two papers showed, from independent angles, that learning to denoise images at multiple noise scales produces a generative model that beats GANs on quality and likelihood. Diffusion took two years to take over image generation and is now the dominant paradigm for images, video, and (increasingly) other modalities.
The forward process
Define a fixed Markov chain that adds Gaussian noise to a clean image
With a small noise schedule
A useful identity: thanks to the Gaussian closure of the chain,
You can sample
The reverse process
To sample, learn
The variance is fixed; the mean is learned. Sample by starting from
The simplified loss
The training objective derives from the variational bound on
A weighted MSE on noise prediction. This is the loss every modern diffusion model uses (with various weighting schemes).
Why this works: score matching
Song et al.'s independent line of work cast diffusion as score-based generative modelling. The score is
Sampling is then equivalent to running Langevin dynamics along the learned score. The connection unifies DDPM with the contemporaneous NCSN (Song & Ermon, 2019) and gives diffusion a clean foundation in stochastic-differential-equation theory.
The continuous-time view (Song et al., 2021) parameterises the forward process as an SDE; the reverse-time SDE has a clean form involving the score. ODE-based deterministic sampling (DDIM, Song et al., 2021) follows from this view and gives much faster sampling.
Architecture
The denoising network is typically a U-Net (Ronneberger et al., 2015) with:
- Convolutional blocks at multiple resolutions.
- Self-attention at lower resolutions to capture global structure.
- Time-step embedding (sinusoidal + MLP) injected via FiLM-style modulation into every block.
Modern systems use Diffusion Transformers (DiT) — pure Transformers operating on patch tokens — for higher quality at scale.
Why diffusion beat GANs
Three structural advantages:
- Stable training. No min-max game; just MSE on noise. Mode collapse and saturation are non-issues.
- High sample diversity. Multi-step generation explores the data distribution rather than collapsing to a few modes.
- Tractable likelihood / coverage. The variational bound gives a principled likelihood; samples cover the data distribution rather than concentrating on a few high-density regions.
The downside: 50–1000 forward passes per sample (vs GAN's one). DDIM, distillation (Salimans, Ho 2022), consistency models (Song et al., 2023), and rectified flow have steadily reduced this cost.
What diffusion enabled
- Photorealistic text-to-image. DALL·E 2, Imagen, Stable Diffusion (latent diffusion).
- Video generation. Sora, Wan, Runway.
- 3D generation. Diffusion priors over neural fields, Gaussian splats, or meshes.
- Audio, molecules, protein structures. Same mathematical machinery, different data.
DDPM is to modern generative AI what AlexNet was to vision: the proof of concept that triggered everything else.
What to read next
- Latent Diffusion — diffusion in a learned compressed space.
- Classifier-Free Guidance — the conditioning trick.
- DALL·E 2 / Imagen — text-to-image diffusion at frontier scale.