DDPM & Score-Based Models

Denoising Diffusion Probabilistic Models (Ho, Jain, Abbeel, NeurIPS 2020) and Score-Based Generative Modeling through Stochastic Differential Equations (Song et al., ICLR 2021) crystallised diffusion as a generative paradigm. The two papers showed, from independent angles, that learning to denoise images at multiple noise scales produces a generative model that beats GANs on quality and likelihood. Diffusion took two years to take over image generation and is now the dominant paradigm for images, video, and (increasingly) other modalities.

The forward process

Define a fixed Markov chain that adds Gaussian noise to a clean image $x_{0}$ over $T$ steps:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) .

With a small noise schedule $β_{t}$ , after $T \approx 1000$ steps the data has been effectively destroyed: $q (x_{T} ∣ x_{0}) \approx N (0, I)$ . The forward process is non-trainable — it has no learnable parameters.

A useful identity: thanks to the Gaussian closure of the chain,

q (x_{t} ∣ x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I), {\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s}) .

You can sample $x_{t}$ in one step from $x_{0}$ — no need to iterate the chain at training time.

The reverse process

To sample, learn $p_{θ} (x_{t - 1} ∣ x_{t})$ — the reverse Markov chain. Ho et al. parameterise it as a Gaussian whose mean is predicted by a network:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I) .

The variance is fixed; the mean is learned. Sample by starting from $x_{T} \sim N (0, I)$ and iteratively denoising.

The simplified loss

The training objective derives from the variational bound on $\log p_{θ} (x_{0})$ , but Ho et al. found a much simpler equivalent: train the network to predict the noise $ϵ$ that produced $x_{t}$ :

L_{simple} = E_{x_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t) ∥^{2}] .

A weighted MSE on noise prediction. This is the loss every modern diffusion model uses (with various weighting schemes).

Why this works: score matching

Song et al.'s independent line of work cast diffusion as score-based generative modelling. The score is $\nabla_{x_{t}} \log q (x_{t})$ — the direction in which density increases. A noise-prediction network is, up to a scaling factor, a score estimator:

\nabla_{x_{t}} \log q (x_{t}) \approx - \frac{ϵ_{θ} (x_{t}, t)}{\sqrt{1 - {\bar{α}}_{t}}} .

Sampling is then equivalent to running Langevin dynamics along the learned score. The connection unifies DDPM with the contemporaneous NCSN (Song & Ermon, 2019) and gives diffusion a clean foundation in stochastic-differential-equation theory.

The continuous-time view (Song et al., 2021) parameterises the forward process as an SDE; the reverse-time SDE has a clean form involving the score. ODE-based deterministic sampling (DDIM, Song et al., 2021) follows from this view and gives much faster sampling.

Architecture

The denoising network is typically a U-Net (Ronneberger et al., 2015) with:

Convolutional blocks at multiple resolutions.
Self-attention at lower resolutions to capture global structure.
Time-step embedding (sinusoidal + MLP) injected via FiLM-style modulation into every block.

Modern systems use Diffusion Transformers (DiT) — pure Transformers operating on patch tokens — for higher quality at scale.

Why diffusion beat GANs

Three structural advantages:

Stable training. No min-max game; just MSE on noise. Mode collapse and saturation are non-issues.
High sample diversity. Multi-step generation explores the data distribution rather than collapsing to a few modes.
Tractable likelihood / coverage. The variational bound gives a principled likelihood; samples cover the data distribution rather than concentrating on a few high-density regions.

The downside: 50–1000 forward passes per sample (vs GAN's one). DDIM, distillation (Salimans, Ho 2022), consistency models (Song et al., 2023), and rectified flow have steadily reduced this cost.

What diffusion enabled

Photorealistic text-to-image. DALL·E 2, Imagen, Stable Diffusion (latent diffusion).
Video generation. Sora, Wan, Runway.
3D generation. Diffusion priors over neural fields, Gaussian splats, or meshes.
Audio, molecules, protein structures. Same mathematical machinery, different data.

DDPM is to modern generative AI what AlexNet was to vision: the proof of concept that triggered everything else.

DDPM & Score-Based Models ​

The forward process ​

The reverse process ​

The simplified loss ​

Why this works: score matching ​

Architecture ​

Why diffusion beat GANs ​

What diffusion enabled ​

What to read next ​