Skip to content

Linear & Quadratic Discriminant Analysis

LDA and QDA are generative classifiers: model each class's feature distribution as a multivariate Gaussian, then classify by Bayes' rule. They sit between logistic regression (also linear, but discriminative) and Naive Bayes (also generative, but with a stronger independence assumption). LDA in particular is a workhorse baseline that doubles as a dimensionality reduction technique.

The model

For K classes, assume

P(xy=k)=N(μk,Σk),P(y=k)=πk.

By Bayes' rule, the log-posterior is

logP(y=kx)=12(xμk)Σk1(xμk)12log|Σk|+logπk+const.

The classifier picks the class with the highest log-posterior. The shape of the decision boundary depends on what we assume about Σk.

QDA — class-specific covariance

If each class has its own covariance Σk, the decision boundaries are quadratic surfaces — ellipsoids, hyperboloids, paraboloids. QDA fits one mean μk and one covariance Σk per class:

μ^k=1Nki:yi=kxi,Σ^k=1Nki:yi=k(xiμ^k)(xiμ^k).

QDA needs Kd(d+1)/2 parameters for the covariances — quadratically many in feature dimension — which makes it data-hungry.

LDA — shared covariance

If all classes share a single covariance Σk=Σ, the quadratic terms cancel and the decision boundary becomes linear:

δk(x)=xΣ1μk12μkΣ1μk+logπk.

LDA estimates a single pooled covariance from all classes:

Σ^=1NKki:yi=k(xiμ^k)(xiμ^k).

This is exactly Gaussian Naive Bayes with non-diagonal covariance — the difference from Naive Bayes is that LDA captures correlations between features.

LDA as dimensionality reduction

Beyond classification, LDA gives a supervised projection. The decision rules depend only on Σ1μk — at most K1 independent directions. Fisher's LDA maximises the ratio of between-class to within-class variance:

maxwwSBwwSWw,

with SB the between-class scatter and SW the within-class scatter. The solution is a generalised eigenvalue problem; the top K1 eigenvectors give a (K1)-dimensional projection that maximally separates classes.

This is the supervised counterpart to PCA — PCA finds high-variance directions; LDA finds high class-separation directions. For visualisation of a labelled dataset in 2D, LDA-projected scatter plots are often more informative than PCA.

Regularisation: shrinkage and RDA

When d is large or N small, Σ^ may be singular or near-singular. Two fixes:

  • Shrinkage — replace Σ^ with (1α)Σ^+αdiag(Σ^) for some α[0,1], blending toward the diagonal-only (Naive Bayes) covariance.
  • Regularised Discriminant Analysis (Friedman, 1989) — interpolate between LDA and QDA: Σ^k(γ)=γΣ^k+(1γ)Σ^, with γ chosen by cross-validation.

When LDA / QDA win

  • Modest dimension, Gaussian-ish features — LDA matches or beats logistic regression, often with much faster training.
  • Multi-class problems with equal effort across classes — LDA naturally handles all K at once.
  • As a dimensionality-reduction step before another classifier — Fisher LDA gives strong class-separation directions almost for free.

The Gaussian assumption is rarely exactly true, but LDA is robust to mild deviations. For categorical or extremely skewed data, generalised additive models or tree-based methods are better.

  • Naive Bayes — same Gaussian likelihood, but with diagonal covariance.
  • Logistic Regression — the discriminative analogue with the same linear boundary.
  • PCA & SVD — the unsupervised analogue of Fisher LDA.

Released under the MIT License. Content imported and adapted from NoteNextra.