Variational Autoencoders
The Variational Autoencoder (VAE) is the probabilistic relative of the vanilla autoencoder. It learns a generative model
The setup
We posit a latent variable
is intractable. Introduce an encoder
Maximising the ELBO simultaneously fits a likelihood model (first term) and an encoder that approximates the true posterior (the gap is a KL).
The reparameterisation trick
Naively,
Now
What VAEs learn
The KL term acts as an information-theoretic regulariser, pulling the posterior
- Smooth latent space — interpolations between two encoded points decode to plausible interpolations of the data, unlike vanilla autoencoders.
- Sampling works — drawing
and decoding produces samples from the learned distribution. (Quality on natural images is markedly worse than GANs or diffusion, but the probabilistic guarantee is what VAEs gave the field.) - Posterior collapse — when the decoder is very expressive, the model can learn to ignore
and put all information into the decoder, making collapse to the prior. A recurring failure mode addressed by KL annealing, free-bits, and stronger encoders.
β-VAE and disentanglement
β-VAE (Higgins et al., ICLR 2017) multiplies the KL term by
Larger
VQ-VAE — discrete latents
Neural Discrete Representation Learning (van den Oord, Vinyals, Kavukcuoglu, NeurIPS 2017) replaces the Gaussian latent with a codebook of discrete vectors. The encoder output is snapped to the nearest codebook entry; gradients are passed through with a straight-through estimator. Discrete latents enable downstream autoregressive priors (like the dVAE in DALL·E) and powered the audio compression in WaveNet/SoundStream and the video tokenisation in modern video models.
Why VAEs lost on samples but won on infrastructure
VAE samples on natural images are notoriously blurry — the Gaussian likelihood penalises sharp boundaries that the encoder can't perfectly reconstruct. GANs and diffusion produce visibly sharper outputs. But VAEs persist as the latent-space stage of latent diffusion models (Stable Diffusion uses a VAE-style autoencoder to compress to latent), as the discrete tokeniser in image-generating Transformers (DALL·E's dVAE), and as a probabilistic baseline wherever you want a likelihood, sampling, and a smooth latent — all in one model.
What to read next
- Generative Adversarial Networks — sharper samples without an explicit likelihood.
- Normalizing Flows — exact-likelihood generative models.
- Autoencoders — the deterministic predecessor.