k-Means
k-Means is the simplest clustering algorithm — partition
The objective
k-Means minimises the within-cluster sum of squares (WCSS):
where
Lloyd's algorithm
- Initialise
centroids (random points, k-means++, or external seeds). - Assignment step:
. - Update step:
. - Repeat until assignments don't change.
Each step decreases the WCSS — assignment because each point picks its closest centroid, update because the mean is the squared-error minimiser of a fixed cluster. The algorithm converges in finitely many steps (the assignment is one of
k-means++ initialisation
Random initialisation is bad — clusters end up empty or local optima dominate. k-means++ (Arthur, Vassilvitskii, 2007) chooses initial centroids one at a time, with each new centroid sampled with probability proportional to the squared distance from the closest already-chosen centroid:
The result is provably
Choosing
The number of clusters is a hyperparameter, not a learned quantity. Standard heuristics:
- Elbow method — plot WCSS vs
, look for the "elbow" where additional clusters give diminishing returns. - Silhouette score — for each point,
where is mean intra-cluster distance and is mean nearest-other-cluster distance. Higher is better; pick maximising mean silhouette. - Gap statistic (Tibshirani et al., 2001) — compare WCSS to that under a null reference distribution.
- Domain knowledge — sometimes you know how many clusters you want.
In practice, the elbow is often ambiguous and silhouette is the more reliable score. For high-dimensional data, all of these become unreliable simultaneously.
Limitations
- Spherical clusters only. k-Means assumes clusters of similar size and shape (isotropic). Elongated, non-convex, or unequal-density clusters are recovered badly.
- Sensitive to outliers. Squared distance heavily penalises far points; outliers pull centroids.
- Requires Euclidean-meaningful distances. Use after appropriate scaling, or switch to k-medoids for non-Euclidean spaces.
- No probability output. A point either belongs to a cluster or doesn't — no soft assignment. GMMs generalise this.
Variants
- k-medoids (PAM) — replace centroids with actual data points; works in any metric space, more robust to outliers.
- Mini-batch k-Means — subsample at each step. Standard for billion-point datasets.
- Soft k-Means / fuzzy c-means — assign weighted membership to all clusters, intermediate between k-means and GMMs.
- k-Means++ for general
— same idea, different distance.
When k-Means is the right tool
- First-pass clustering — when you don't know the structure, k-means gives a quick partition with little tuning.
- Vector quantisation — encoding via centroid indices for compression, retrieval, or VQ-VAE.
- Initialisation for GMMs — fitting a GMM by EM benefits from k-means initialisation of means.
- Large-scale embedding clustering — combined with mini-batch updates, k-means scales to billions of vectors.
What to read next
- Gaussian Mixture Models & EM — the probabilistic generalisation.
- Hierarchical Clustering — for variable-
exploration. - PCA & SVD — typical pre-processing for high-dimensional data.