Classifier-Free Guidance
Classifier-Free Diffusion Guidance (Ho, Salimans, NeurIPS-W 2021) is the technique that made conditional diffusion actually work for text-to-image. Every modern diffusion-based generator — Stable Diffusion, DALL·E 2, Imagen, Sora — uses CFG. The idea is two pages of math but its practical impact is hard to overstate: without CFG, text-conditioned diffusion produces blurry, prompt-ignoring outputs.
The conditioning problem
A conditional diffusion model wants to sample from
This works but produces samples that insufficiently respect the condition. Naive conditional diffusion follows the data distribution faithfully, including the parts of the conditional that don't strongly depend on
Classifier guidance
The first fix, classifier guidance (Dhariwal & Nichol, NeurIPS 2021), used Bayes' rule:
Train an unconditional diffusion model and a separate noise-aware classifier
The guidance scale
Classifier-free guidance
Ho and Salimans realised the classifier could be derived implicitly:
Train a single network that predicts noise both with and without the condition (achieved by randomly dropping the condition during training, ~10–20% of the time). At sampling time, combine the two predictions:
The CFG scale
— pure conditional sampling, no boost. – — typical text-to-image regime; samples are sharp and prompt-adherent. — sample collapses to the highest-likelihood mode, often producing oversaturated, nonsensical images.
Why CFG works so well
Mathematically, CFG implements a form of importance reweighting that emphasises image regions where the conditional density is high relative to the unconditional. Practically:
- Single network — no separate classifier to train.
- Sharp outputs — high
produces images that strongly match the prompt. - Trade-off knob —
at inference time trades sample diversity for prompt adherence. Run the same prompt at and to see the trade-off in action.
The cost: two forward passes per denoising step (one with condition, one without). Modern systems batch the two for efficient inference.
Negative prompts
CFG generalises naturally to negative prompts — content the user wants the model to avoid. Replace the unconditional pass with a pass conditioned on a negative prompt
The model is pushed toward
Limitations and follow-ups
- Computational cost. Two forward passes per step. Distillation work (Meng et al., 2022) compresses CFG into a single pass.
- Saturation at high
. Pushes outputs toward unrealistic over-saturation; modern guidance variants (rescaled guidance, dynamic thresholding) address this. - Doesn't always help text — classifier-free guidance helps less in language modelling, where the analogous "guided sampling" is replaced by RLHF or rejection sampling.
What CFG is for, today
CFG is now a default ingredient of every text-to-image and text-to-video diffusion system. Its conceptual descendants include:
- Image-conditional guidance — guide on a reference image as well as text.
- Multi-condition guidance — combine several conditions with weighted CFG.
- Inversion + guidance for editing — inject CFG during DDIM inversion to edit existing images.
A two-page paper that became a load-bearing piece of every modern generative-image stack.
What to read next
- DDPM & Score-Based Models — the underlying generative paradigm.
- Latent Diffusion — CFG in latent space, the Stable Diffusion recipe.
- DALL·E 2 / Imagen — frontier T2I systems built on CFG.