Gaussian Mixture Models & EM
A Gaussian Mixture Model (GMM) is a probabilistic version of k-means: instead of assigning each point to a single cluster, it models the data as a mixture of Gaussian densities and produces soft assignments. The Expectation-Maximization (EM) algorithm — used to fit GMMs — generalises far beyond clustering, becoming a universal recipe for maximum-likelihood estimation with latent variables.
The model
A
with mixing weights
The latent variable
The EM algorithm
We want
E-step. Compute the responsibilities — posterior probabilities of each component given the data:
M-step. Update parameters as if the responsibilities were the labels:
Iterate until the log-likelihood converges.
Why EM works: the ELBO
EM is coordinate ascent on a lower bound of the log-likelihood. For any distribution
This is the ELBO — also the centre of VAE training.
- E-step sets
, making the bound tight. - M-step maximises
over with fixed.
Each iteration cannot decrease
GMM vs k-Means
GMM with isotropic, equal-variance Gaussians and "hard" responsibilities reduces to k-means. The differences:
- Soft assignments.
is a probability, not a one-hot — points near a cluster boundary contribute to multiple clusters proportionally. - Anisotropic clusters. Full
captures elongated, oriented cluster shapes that k-means cannot. - Probabilistic output. GMMs give a density estimate
, useful for outlier detection and generative sampling. - Higher computational cost.
per E-step instead of .
Initialisation and degeneracies
EM is sensitive to initialisation:
- Run k-means first; use its centroids and per-cluster covariance as the GMM init.
- Multiple random restarts; pick the highest-likelihood solution.
GMMs have a famous degeneracy: a Gaussian centred on a single training point with vanishing covariance achieves infinite likelihood (the density blows up). Standard fix: add a small regularisation to each
Beyond clustering
EM works for any latent-variable model with tractable conditional posteriors:
- Hidden Markov Models (see HMM) — the latent is a sequence; EM is the Baum-Welch algorithm.
- Probabilistic PCA — Gaussian latent, Gaussian likelihood; EM gives PCA in the limit.
- Topic models (LDA) — EM-style variational inference over latent topic assignments.
The EM template — bound the log-likelihood with the ELBO, alternate E and M — is one of the most general inference recipes in machine learning.
What to read next
- k-Means — the hard-assignment special case.
- Bayes Nets — graphical models where EM applies broadly.
- Variational Autoencoders — modern amortised EM with neural networks.