Skip to content

Gaussian Mixture Models & EM

A Gaussian Mixture Model (GMM) is a probabilistic version of k-means: instead of assigning each point to a single cluster, it models the data as a mixture of Gaussian densities and produces soft assignments. The Expectation-Maximization (EM) algorithm — used to fit GMMs — generalises far beyond clustering, becoming a universal recipe for maximum-likelihood estimation with latent variables.

The model

A K-component GMM is

p(x)=k=1KπkN(x;μk,Σk),

with mixing weights πk0, kπk=1. Generative story: pick a component k with probability πk, then sample xN(μk,Σk).

The latent variable z{1,,K} — which component generated x — is not observed. This is what makes the likelihood non-convex and EM necessary.

The EM algorithm

We want argmaxθlogp(X;θ)=argmaxθilogkπkN(xi;μk,Σk). The log-of-sum is intractable. EM exploits the latent-variable structure.

E-step. Compute the responsibilities — posterior probabilities of each component given the data:

γik=P(zi=kxi;θold)=πkoldN(xi;μkold,Σkold)jπjoldN(xi;μjold,Σjold).

M-step. Update parameters as if the responsibilities were the labels:

Nk=iγik,μk=1Nkiγikxi,Σk=1Nkiγik(xiμk)(xiμk),πk=Nk/N.

Iterate until the log-likelihood converges.

Why EM works: the ELBO

EM is coordinate ascent on a lower bound of the log-likelihood. For any distribution q(z),

logp(x;θ)Eq[logp(x,z;θ)]Eq[logq(z)]L(q,θ).

This is the ELBO — also the centre of VAE training.

  • E-step sets q(z)=p(zx;θold), making the bound tight.
  • M-step maximises L over θ with q fixed.

Each iteration cannot decrease logp(x;θ) — EM monotonically improves the likelihood until a stationary point. (It does not necessarily find the global optimum; multiple restarts and good initialisation matter.)

GMM vs k-Means

GMM with isotropic, equal-variance Gaussians and "hard" responsibilities reduces to k-means. The differences:

  • Soft assignments. γik is a probability, not a one-hot — points near a cluster boundary contribute to multiple clusters proportionally.
  • Anisotropic clusters. Full Σk captures elongated, oriented cluster shapes that k-means cannot.
  • Probabilistic output. GMMs give a density estimate p(x), useful for outlier detection and generative sampling.
  • Higher computational cost. O(Kd2) per E-step instead of O(Kd).

Initialisation and degeneracies

EM is sensitive to initialisation:

  • Run k-means first; use its centroids and per-cluster covariance as the GMM init.
  • Multiple random restarts; pick the highest-likelihood solution.

GMMs have a famous degeneracy: a Gaussian centred on a single training point with vanishing covariance achieves infinite likelihood (the density blows up). Standard fix: add a small regularisation to each Σk (ΣkΣk+ϵI) or use a Bayesian prior (Bayesian GMM, variational mixture of Gaussians).

Beyond clustering

EM works for any latent-variable model with tractable conditional posteriors:

  • Hidden Markov Models (see HMM) — the latent is a sequence; EM is the Baum-Welch algorithm.
  • Probabilistic PCA — Gaussian latent, Gaussian likelihood; EM gives PCA in the limit.
  • Topic models (LDA) — EM-style variational inference over latent topic assignments.

The EM template — bound the log-likelihood with the ELBO, alternate E and M — is one of the most general inference recipes in machine learning.

Released under the MIT License. Content imported and adapted from NoteNextra.