Autoencoders & Denoising AEs
An autoencoder is a network trained to reconstruct its own input through a low-dimensional bottleneck. The bottleneck forces the model to compress the data into a useful latent representation; the reconstruction loss makes the compression lossless on the training distribution. Autoencoders are the structural ancestor of VAEs, MAE, and the encoder-decoder denoiser at the heart of latent-diffusion image generators.
The basic autoencoder
Given an input
The bottleneck dimension
Denoising autoencoders
Extracting and Composing Robust Features with Denoising Autoencoders (Vincent, Larochelle, Bengio, Manzagol, ICML 2008) trains an autoencoder to reconstruct the clean input from a corrupted version
Common corruptions: Gaussian noise, salt-and-pepper noise, masking. Forcing the model to "undo" noise eliminates the trivial identity-map solution and makes the bottleneck unnecessary — DAEs work even when
The deeper result (Vincent 2011, Alain & Bengio 2014) is that a denoising autoencoder trained on Gaussian noise of small variance learns a function whose residual is the score of the data distribution:
This is the connection to score-based and diffusion models — denoising at multiple noise scales is the entire training objective of DDPM and friends.
Sparse, contractive, and stacked autoencoders
The 2008–2013 literature explored several variants to encourage useful latent structure:
- Sparse autoencoders — add an L1 (or KL-from-Bernoulli) penalty on hidden activations, forcing each input to use only a few latent dimensions.
- Contractive autoencoders — add a Frobenius-norm penalty on the encoder Jacobian
, encouraging the encoder to be insensitive to input perturbations except along the data manifold. - Stacked autoencoders — pretrain layer-by-layer (greedy unsupervised pretraining), then fine-tune. Was the dominant deep-learning recipe in 2007–2010 before ReLU + good init made supervised pretraining straightforward.
Most of these ideas survive in spirit (sparsity penalties, manifold-respecting representations) but rarely as named pipelines. The descendants people actually use today are VAEs, MAEs, and the denoiser inside latent diffusion.
What modern systems use AEs for
- Latent-diffusion models (Stable Diffusion etc.) — train a VAE-style autoencoder once, then run all subsequent diffusion in the compact latent space. The autoencoder is the reason diffusion is computationally tractable at high resolution.
- MAE (see representation learning) — an asymmetric denoising autoencoder where the corruption is masking 75% of patches.
- Anomaly detection — train on in-distribution data; high reconstruction error at test time flags outliers.
- Compression — neural image and video codecs (Ballé et al.) are autoencoders with a learned entropy model on the latents.
What to read next
- Variational Autoencoders — the probabilistic generative version.
- Generative Adversarial Networks — the alternative approach to learning a sampler.
- PixelRNN / PixelCNN — autoregressive image modelling, a third axis in the generative-model space.