Skip to content

Variational Autoencoders

The Variational Autoencoder (VAE) is the probabilistic relative of the vanilla autoencoder. It learns a generative model pθ(x,z)=p(z)pθ(xz) that supports both reconstruction and ancestral sampling, by training a recognition network qϕ(zx) to approximate the posterior. The two-network setup, the reparameterisation trick, and the ELBO objective are now standard machinery in many later models.

The setup

We posit a latent variable zp(z)=N(0,I) and a likelihood pθ(xz) — a decoder network parameterising, e.g., a Gaussian over x. The marginal log-likelihood

logpθ(x)=logpθ(xz)p(z)dz

is intractable. Introduce an encoder qϕ(zx) — usually a Gaussian whose mean and log-variance are output by a network — and use Jensen's inequality to derive the Evidence Lower BOund (ELBO):

logpθ(x)Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z)).

Maximising the ELBO simultaneously fits a likelihood model (first term) and an encoder that approximates the true posterior (the gap is a KL).

The reparameterisation trick

Naively, ϕEqϕ(zx)[] requires REINFORCE-style high-variance gradients. Auto-Encoding Variational Bayes (Kingma, Welling, ICLR 2014) replaces sampling zqϕ with a deterministic transformation of fixed-distribution noise:

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I).

Now ϕ flows through μ,σ to the encoder weights via standard backprop. This single trick is what made variational inference a tractable deep-learning method, and the reparameterisation idea generalises to discrete (Gumbel-Softmax), structured, and amortised settings.

What VAEs learn

The KL term acts as an information-theoretic regulariser, pulling the posterior qϕ(zx) toward the prior p(z). Empirically:

  • Smooth latent space — interpolations between two encoded points decode to plausible interpolations of the data, unlike vanilla autoencoders.
  • Sampling works — drawing zp(z) and decoding produces samples from the learned distribution. (Quality on natural images is markedly worse than GANs or diffusion, but the probabilistic guarantee is what VAEs gave the field.)
  • Posterior collapse — when the decoder is very expressive, the model can learn to ignore z and put all information into the decoder, making qϕ collapse to the prior. A recurring failure mode addressed by KL annealing, free-bits, and stronger encoders.

β-VAE and disentanglement

β-VAE (Higgins et al., ICLR 2017) multiplies the KL term by β>1:

Lβ-VAE=Eqϕ[logpθ(xz)]βKL(qϕp).

Larger β trades reconstruction for tighter posterior–prior alignment, which empirically encourages disentangled latent factors — individual latent dimensions corresponding to distinct generative factors of variation (rotation, scale, lighting). Subsequent work (Locatello et al., ICML 2019) showed that disentanglement without supervision is fundamentally unidentifiable, but the qualitative behaviour holds.

VQ-VAE — discrete latents

Neural Discrete Representation Learning (van den Oord, Vinyals, Kavukcuoglu, NeurIPS 2017) replaces the Gaussian latent with a codebook of discrete vectors. The encoder output is snapped to the nearest codebook entry; gradients are passed through with a straight-through estimator. Discrete latents enable downstream autoregressive priors (like the dVAE in DALL·E) and powered the audio compression in WaveNet/SoundStream and the video tokenisation in modern video models.

Why VAEs lost on samples but won on infrastructure

VAE samples on natural images are notoriously blurry — the Gaussian likelihood penalises sharp boundaries that the encoder can't perfectly reconstruct. GANs and diffusion produce visibly sharper outputs. But VAEs persist as the latent-space stage of latent diffusion models (Stable Diffusion uses a VAE-style autoencoder to compress to latent), as the discrete tokeniser in image-generating Transformers (DALL·E's dVAE), and as a probabilistic baseline wherever you want a likelihood, sampling, and a smooth latent — all in one model.

Released under the MIT License. Content imported and adapted from NoteNextra.