Hierarchical Clustering
Hierarchical clustering builds a tree of nested groupings — a dendrogram — instead of committing to a single
Two directions: agglomerative and divisive
- Agglomerative (bottom-up) — start with each point as its own cluster, repeatedly merge the two closest clusters until one remains. The standard choice; runs in
with a priority queue. - Divisive (top-down) — start with one cluster, recursively split. Conceptually cleaner but
in worst case; rarely used directly.
The output is a dendrogram — a binary tree whose leaves are points and whose internal nodes record merge events with associated heights (the merge distance). Cutting the dendrogram at a chosen height gives a flat clustering.
Linkage criteria
Once you've defined a distance
- Single linkage —
. Distance to the nearest pair. Tends to produce long, chain-like clusters ("chaining effect"); finds non-convex shapes but is sensitive to noise. - Complete linkage —
. Distance to the farthest pair. Produces compact, roughly spherical clusters. - Average linkage —
. Mean pairwise distance. A reasonable compromise. - Ward's method — merge the pair that produces the smallest increase in within-cluster variance. Equivalent to minimising squared error after each merge; produces variance-balanced clusters similar to those of k-means.
Ward's is the default in most practical applications; single linkage when chaining is genuinely desired (river networks, evolutionary trees); complete linkage when you want compact balls.
Choosing where to cut
The dendrogram makes the choice of
- Largest gap rule — cut at the height with the largest jump between consecutive merges. The "natural"
corresponds to a sharp elbow in merge heights. - Fixed height — cut at a domain-meaningful distance threshold.
- Fixed
— cut to produce exactly clusters.
Hierarchical clustering doesn't require a single
Computational cost
Naive agglomerative clustering is
- Lance-Williams update for linkage updates after each merge — keeps to
. - SLINK / CLINK algorithms —
for single and complete linkage. - HDBSCAN — density-based hierarchical clustering with much better scalability, particularly for large noisy datasets.
For
Strengths and weaknesses
Strengths:
- No need to pre-specify
. The dendrogram captures structure at all scales. - Deterministic — same data always gives the same tree (up to ties).
- Visualisable — dendrograms are interpretable and informative on their own.
- Works in any metric space — no Euclidean-isotropy assumption.
Weaknesses:
- Doesn't scale. Quadratic memory is fatal for large datasets.
- Greedy. Merge decisions are never undone; an early bad merge contaminates the rest.
- Sensitive to distance/linkage choice. Different choices give very different trees.
Where hierarchical clustering wins
- Biology — gene-expression clustering, phylogenetic trees. The hierarchical structure is real, not an artefact.
- Document organisation — topic taxonomies, news clustering at multiple granularities.
- Exploratory analysis of small (
) datasets where you want to see the full structure rather than commit to a single . - As a probe — running hierarchical clustering reveals whether your data has natural cluster scales at all.
What to read next
- k-Means — the partitional alternative.
- GMM & EM — soft probabilistic clustering.
- Manifold Learning — dimensionality reduction for clustering high-dimensional data.