Gaussian Mixture Models & EM

A Gaussian Mixture Model (GMM) is a probabilistic version of k-means: instead of assigning each point to a single cluster, it models the data as a mixture of Gaussian densities and produces soft assignments. The Expectation-Maximization (EM) algorithm — used to fit GMMs — generalises far beyond clustering, becoming a universal recipe for maximum-likelihood estimation with latent variables.

The model

A $K$ -component GMM is

p (x) = \sum_{k = 1}^{K} π_{k} N (x; μ_{k}, Σ_{k}),

with mixing weights $π_{k} \geq 0$ , $\sum_{k} π_{k} = 1$ . Generative story: pick a component $k$ with probability $π_{k}$ , then sample $x \sim N (μ_{k}, Σ_{k})$ .

The latent variable $z \in {1, \dots, K}$ — which component generated $x$ — is not observed. This is what makes the likelihood non-convex and EM necessary.

The EM algorithm

We want $\arg max_{θ} \log p (X; θ) = \arg max_{θ} \sum_{i} \log \sum_{k} π_{k} N (x_{i}; μ_{k}, Σ_{k})$ . The log-of-sum is intractable. EM exploits the latent-variable structure.

E-step. Compute the responsibilities — posterior probabilities of each component given the data:

γ_{i k} = P (z_{i} = k ∣ x_{i}; θ^{old}) = \frac{π_{k}^{old} N (x_{i}; μ_{k}^{old}, Σ_{k}^{old})}{\sum_{j} π_{j}^{old} N (x_{i}; μ_{j}^{old}, Σ_{j}^{old})} .

M-step. Update parameters as if the responsibilities were the labels:

N_{k} = \sum_{i} γ_{i k}, μ_{k} = \frac{1}{N_{k}} \sum_{i} γ_{i k} x_{i}, Σ_{k} = \frac{1}{N_{k}} \sum_{i} γ_{i k} (x_{i} - μ_{k}) (x_{i} - μ_{k})^{⊤}, π_{k} = N_{k} / N .

Iterate until the log-likelihood converges.

Why EM works: the ELBO

EM is coordinate ascent on a lower bound of the log-likelihood. For any distribution $q (z)$ ,

\log p (x; θ) \geq E_{q} [\log p (x, z; θ)] - E_{q} [\log q (z)] \equiv L (q, θ) .

This is the ELBO — also the centre of VAE training.

E-step sets $q (z) = p (z ∣ x; θ^{old})$ , making the bound tight.
M-step maximises $L$ over $θ$ with $q$ fixed.

Each iteration cannot decrease $\log p (x; θ)$ — EM monotonically improves the likelihood until a stationary point. (It does not necessarily find the global optimum; multiple restarts and good initialisation matter.)

GMM vs k-Means

GMM with isotropic, equal-variance Gaussians and "hard" responsibilities reduces to k-means. The differences:

Soft assignments. $γ_{i k}$ is a probability, not a one-hot — points near a cluster boundary contribute to multiple clusters proportionally.
Anisotropic clusters. Full $Σ_{k}$ captures elongated, oriented cluster shapes that k-means cannot.
Probabilistic output. GMMs give a density estimate $p (x)$ , useful for outlier detection and generative sampling.
Higher computational cost. $O (K d^{2})$ per E-step instead of $O (K d)$ .

Initialisation and degeneracies

EM is sensitive to initialisation:

Run k-means first; use its centroids and per-cluster covariance as the GMM init.
Multiple random restarts; pick the highest-likelihood solution.

GMMs have a famous degeneracy: a Gaussian centred on a single training point with vanishing covariance achieves infinite likelihood (the density blows up). Standard fix: add a small regularisation to each $Σ_{k}$ ( $Σ_{k} \leftarrow Σ_{k} + ϵ I$ ) or use a Bayesian prior (Bayesian GMM, variational mixture of Gaussians).

Beyond clustering

EM works for any latent-variable model with tractable conditional posteriors:

Hidden Markov Models (see HMM) — the latent is a sequence; EM is the Baum-Welch algorithm.
Probabilistic PCA — Gaussian latent, Gaussian likelihood; EM gives PCA in the limit.
Topic models (LDA) — EM-style variational inference over latent topic assignments.

The EM template — bound the log-likelihood with the ELBO, alternate E and M — is one of the most general inference recipes in machine learning.

Gaussian Mixture Models & EM ​

The model ​

The EM algorithm ​

Why EM works: the ELBO ​

GMM vs k-Means ​

Initialisation and degeneracies ​

Beyond clustering ​

What to read next ​