Skip to content

PCA & SVD

Principal Component Analysis is the canonical linear dimensionality-reduction technique: find the directions of maximum variance in the data and project onto the top k of them. Computationally, it is just the SVD of the centred data matrix. PCA is the "first thing to try" for any high-dimensional dataset and the conceptual ancestor of every later representation-learning method.

The objective

Given N data points xiRd, find a k-dimensional projection that maximises projected variance — equivalently, minimises reconstruction error. Centre the data: x~i=xix¯. The first principal component is

w1=argmaxw=1Var(wx~)=argmaxw=1wSw,

with S=1Nix~ix~i the (sample) covariance matrix. The top-k components are the eigenvectors of S corresponding to the k largest eigenvalues.

SVD computation

For the centred data matrix X~RN×d (one row per sample), the SVD is X~=UΣV. Then:

  • The columns of V are the principal components (eigenvectors of X~X~=NS).
  • The squared singular values σi2/N are the variances along each component.
  • The projected data is X~Vk=UkΣk — the first k columns of UΣ.

For dN, work with the d×d covariance and its eigendecomposition. For Nd (e.g., genomics with N samples and d genes), work with the N×N Gram matrix instead. Both routes give the same components.

Three views

PCA can be derived from three independent objectives that all agree:

  • Maximum variance — pick the projection whose projected variance is largest.
  • Minimum reconstruction error — pick the projection whose reconstruction x^=VkVkx~ minimises ixix^i2.
  • Decorrelation — find an orthonormal basis in which features are linearly uncorrelated.

These coincide because of the Eckart-Young theorem: the best rank-k approximation of X~ in Frobenius norm is the truncated SVD.

Whitening

After projecting, you can whiten by dividing each component by its singular value:

z=Σk1Vkx~.

The result has unit covariance and is the input format expected by some downstream methods (Fisher LDA, ICA). Whitening is a regularisation choice — it equalises components, removing the natural variance-based weighting.

Choosing k

  • Variance explainedk such that ikσi2/iσi20.95 (or 0.99). Standard but ad-hoc.
  • Scree plot — plot σi2 vs i, look for an "elbow".
  • Cross-validate — pick k minimising downstream-task error.

Probabilistic PCA

Probabilistic PCA (Tipping, Bishop, 1999) gives a generative model:

x=Wz+μ+ϵ,zN(0,I),ϵN(0,σ2I).

The maximum-likelihood W converges to the top-k eigenvectors of S as σ20. The probabilistic view enables Bayesian PCA, missing-data PCA, and the VAE-style extensions.

Limitations

  • Linear. PCA finds the best linear subspace. Non-linear structure (manifolds, clusters) is missed.
  • Variance is not always meaning. High variance might come from noise or scale, not signal. Standardise inputs first if features are on different scales.
  • Sensitive to outliers. Squared error is dominated by extreme points; consider robust PCA for noisy data.

For non-linear structure use t-SNE / UMAP or autoencoders. For non-Gaussian latents, use ICA or normalising flows.

What PCA is for, today

  • Visualisation of high-dimensional data — project to 2D / 3D for inspection.
  • Compression with reconstruction guarantees — top-k SVD is the optimal rank-k approximation.
  • Pre-processing for downstream methods sensitive to dimensionality (kNN, GMM, clustering).
  • Feature analysis — examining principal components reveals dominant axes of variation.
  • Inside larger systems — covariance estimators in finance, denoising, image compression (DCT-style).

In the deep-learning era, PCA's role has shifted from feature extractor to diagnostic tool — it tells you whether low-dimensional structure exists before you commit to a more expressive model.

Released under the MIT License. Content imported and adapted from NoteNextra.