PCA & SVD

Principal Component Analysis is the canonical linear dimensionality-reduction technique: find the directions of maximum variance in the data and project onto the top $k$ of them. Computationally, it is just the SVD of the centred data matrix. PCA is the "first thing to try" for any high-dimensional dataset and the conceptual ancestor of every later representation-learning method.

The objective

Given $N$ data points $x_{i} \in R^{d}$ , find a $k$ -dimensional projection that maximises projected variance — equivalently, minimises reconstruction error. Centre the data: ${\tilde{x}}_{i} = x_{i} - \bar{x}$ . The first principal component is

w_{1} = \arg max_{∥ w ∥ = 1} Var (w^{⊤} \tilde{x}) = \arg max_{∥ w ∥ = 1} w^{⊤} S w,

with $S = \frac{1}{N} \sum_{i} {\tilde{x}}_{i} {\tilde{x}}_{i}^{⊤}$ the (sample) covariance matrix. The top- $k$ components are the eigenvectors of $S$ corresponding to the $k$ largest eigenvalues.

SVD computation

For the centred data matrix $\tilde{X} \in R^{N \times d}$ (one row per sample), the SVD is $\tilde{X} = U Σ V^{⊤}$ . Then:

The columns of $V$ are the principal components (eigenvectors of ${\tilde{X}}^{⊤} \tilde{X} = N S$ ).
The squared singular values $σ_{i}^{2} / N$ are the variances along each component.
The projected data is $\tilde{X} V_{k} = U_{k} Σ_{k}$ — the first $k$ columns of $U Σ$ .

For $d ≪ N$ , work with the $d \times d$ covariance and its eigendecomposition. For $N ≪ d$ (e.g., genomics with $N$ samples and $d$ genes), work with the $N \times N$ Gram matrix instead. Both routes give the same components.

Three views

PCA can be derived from three independent objectives that all agree:

Maximum variance — pick the projection whose projected variance is largest.
Minimum reconstruction error — pick the projection whose reconstruction $\hat{x} = V_{k} V_{k}^{⊤} \tilde{x}$ minimises $\sum_{i} ∥ x_{i} - {\hat{x}}_{i} ∥^{2}$ .
Decorrelation — find an orthonormal basis in which features are linearly uncorrelated.

These coincide because of the Eckart-Young theorem: the best rank- $k$ approximation of $\tilde{X}$ in Frobenius norm is the truncated SVD.

Whitening

After projecting, you can whiten by dividing each component by its singular value:

z = Σ_{k}^{- 1} V_{k}^{⊤} \tilde{x} .

The result has unit covariance and is the input format expected by some downstream methods (Fisher LDA, ICA). Whitening is a regularisation choice — it equalises components, removing the natural variance-based weighting.

Choosing $k$

Variance explained — $k$ such that $\sum_{i \leq k} σ_{i}^{2} / \sum_{i} σ_{i}^{2} \geq 0.95$ (or 0.99). Standard but ad-hoc.
Scree plot — plot $σ_{i}^{2}$ vs $i$ , look for an "elbow".
Cross-validate — pick $k$ minimising downstream-task error.

Probabilistic PCA

Probabilistic PCA (Tipping, Bishop, 1999) gives a generative model:

x = W z + μ + ϵ, z \sim N (0, I), ϵ \sim N (0, σ^{2} I) .

The maximum-likelihood $W$ converges to the top- $k$ eigenvectors of $S$ as $σ^{2} \to 0$ . The probabilistic view enables Bayesian PCA, missing-data PCA, and the VAE-style extensions.

Limitations

Linear. PCA finds the best linear subspace. Non-linear structure (manifolds, clusters) is missed.
Variance is not always meaning. High variance might come from noise or scale, not signal. Standardise inputs first if features are on different scales.
Sensitive to outliers. Squared error is dominated by extreme points; consider robust PCA for noisy data.

For non-linear structure use t-SNE / UMAP or autoencoders. For non-Gaussian latents, use ICA or normalising flows.

What PCA is for, today

Visualisation of high-dimensional data — project to 2D / 3D for inspection.
Compression with reconstruction guarantees — top- $k$ SVD is the optimal rank- $k$ approximation.
Pre-processing for downstream methods sensitive to dimensionality (kNN, GMM, clustering).
Feature analysis — examining principal components reveals dominant axes of variation.
Inside larger systems — covariance estimators in finance, denoising, image compression (DCT-style).

In the deep-learning era, PCA's role has shifted from feature extractor to diagnostic tool — it tells you whether low-dimensional structure exists before you commit to a more expressive model.

PCA & SVD ​

The objective ​

SVD computation ​

Three views ​

Whitening ​

Choosing k ​

Probabilistic PCA ​

Limitations ​

What PCA is for, today ​

What to read next ​