Linear & Quadratic Discriminant Analysis

LDA and QDA are generative classifiers: model each class's feature distribution as a multivariate Gaussian, then classify by Bayes' rule. They sit between logistic regression (also linear, but discriminative) and Naive Bayes (also generative, but with a stronger independence assumption). LDA in particular is a workhorse baseline that doubles as a dimensionality reduction technique.

The model

For $K$ classes, assume

P (x ∣ y = k) = N (μ_{k}, Σ_{k}), P (y = k) = π_{k} .

By Bayes' rule, the log-posterior is

\log P (y = k ∣ x) = - \frac{1}{2} (x - μ_{k})^{⊤} Σ_{k}^{- 1} (x - μ_{k}) - \frac{1}{2} \log | Σ_{k} | + \log π_{k} + const .

The classifier picks the class with the highest log-posterior. The shape of the decision boundary depends on what we assume about $Σ_{k}$ .

QDA — class-specific covariance

If each class has its own covariance $Σ_{k}$ , the decision boundaries are quadratic surfaces — ellipsoids, hyperboloids, paraboloids. QDA fits one mean $μ_{k}$ and one covariance $Σ_{k}$ per class:

{\hat{μ}}_{k} = \frac{1}{N_{k}} \sum_{i : y_{i} = k} x_{i}, {\hat{Σ}}_{k} = \frac{1}{N_{k}} \sum_{i : y_{i} = k} (x_{i} - {\hat{μ}}_{k}) (x_{i} - {\hat{μ}}_{k})^{⊤} .

QDA needs $K \cdot d (d + 1) / 2$ parameters for the covariances — quadratically many in feature dimension — which makes it data-hungry.

LDA — shared covariance

If all classes share a single covariance $Σ_{k} = Σ$ , the quadratic terms cancel and the decision boundary becomes linear:

δ_{k} (x) = x^{⊤} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ^{- 1} μ_{k} + \log π_{k} .

LDA estimates a single pooled covariance from all classes:

\hat{Σ} = \frac{1}{N - K} \sum_{k} \sum_{i : y_{i} = k} (x_{i} - {\hat{μ}}_{k}) (x_{i} - {\hat{μ}}_{k})^{⊤} .

This is exactly Gaussian Naive Bayes with non-diagonal covariance — the difference from Naive Bayes is that LDA captures correlations between features.

LDA as dimensionality reduction

Beyond classification, LDA gives a supervised projection. The decision rules depend only on $Σ^{- 1} μ_{k}$ — at most $K - 1$ independent directions. Fisher's LDA maximises the ratio of between-class to within-class variance:

max_{w} \frac{w^{⊤} S_{B} w}{w^{⊤} S_{W} w},

with $S_{B}$ the between-class scatter and $S_{W}$ the within-class scatter. The solution is a generalised eigenvalue problem; the top $K - 1$ eigenvectors give a $(K - 1)$ -dimensional projection that maximally separates classes.

This is the supervised counterpart to PCA — PCA finds high-variance directions; LDA finds high class-separation directions. For visualisation of a labelled dataset in 2D, LDA-projected scatter plots are often more informative than PCA.

Regularisation: shrinkage and RDA

When $d$ is large or $N$ small, $\hat{Σ}$ may be singular or near-singular. Two fixes:

Shrinkage — replace $\hat{Σ}$ with $(1 - α) \hat{Σ} + α \cdot diag (\hat{Σ})$ for some $α \in [0, 1]$ , blending toward the diagonal-only (Naive Bayes) covariance.
Regularised Discriminant Analysis (Friedman, 1989) — interpolate between LDA and QDA: ${\hat{Σ}}_{k} (γ) = γ {\hat{Σ}}_{k} + (1 - γ) \hat{Σ}$ , with $γ$ chosen by cross-validation.

When LDA / QDA win

Modest dimension, Gaussian-ish features — LDA matches or beats logistic regression, often with much faster training.
Multi-class problems with equal effort across classes — LDA naturally handles all $K$ at once.
As a dimensionality-reduction step before another classifier — Fisher LDA gives strong class-separation directions almost for free.

The Gaussian assumption is rarely exactly true, but LDA is robust to mild deviations. For categorical or extremely skewed data, generalised additive models or tree-based methods are better.

Linear & Quadratic Discriminant Analysis ​

The model ​

QDA — class-specific covariance ​

LDA — shared covariance ​

LDA as dimensionality reduction ​

Regularisation: shrinkage and RDA ​

When LDA / QDA win ​

What to read next ​