Skip to content

Bias–Variance Tradeoff

The bias–variance decomposition is the classical lens for understanding generalisation error. It splits expected test error into three contributions — bias (systematic mismatch between model and truth), variance (sensitivity to training data), and irreducible noise. Modern over-parameterised deep learning complicates the picture (see double descent) but does not invalidate the classical case.

The decomposition

Suppose data is generated as y=f(x)+ε with E[ε]=0 and Var(ε)=σ2. Train a learner on a random training set D to produce a predictor f^D(x). The expected squared error at a test point x, averaged over the random draw of D, decomposes as

ED[(yf^D(x))2]=(ED[f^D(x)]f(x))2Bias2+VarD(f^D(x))Variance+σ2Noise.

Three sources, each minimisable independently up to a point — and the first two trade off.

Bias and variance, intuitively

  • Bias measures how far the average model is from the truth. High bias = the model class is too restrictive. Linear regression on a sinusoidal target is high-bias.
  • Variance measures how much the model fluctuates as the training set changes. High variance = the model is too flexible relative to the data. A 1-NN classifier is high-variance.
  • Noise is the floor — irreducible from the data alone.

A high-bias model underfits; a high-variance model overfits.

The classical U-curve

Plot test error against model complexity. The classical picture:

  • At low complexity, bias dominates — increasing complexity reduces bias faster than it increases variance.
  • At high complexity, variance dominates — the model memorises training noise.
  • The optimal complexity lives at the U-shaped minimum of test error.

This drove decades of model-selection wisdom: regularise to limit complexity, validate on held-out data, prefer the simpler model that performs comparably (Occam's razor).

What deep learning broke

Modern deep networks have so many parameters that they can perfectly memorise training data — yet they generalise. This should not happen in the classical U-curve picture. The resolution, established empirically by Reconciling Modern Machine Learning Practice and the Classical Bias–Variance Trade-off (Belkin et al., PNAS 2019), is double descent: as capacity grows past the interpolation threshold, test error first peaks (the classical regime hits its maximum) and then descends again into a low-error over-parameterised regime.

The mechanism: in the over-parameterised regime, infinitely many parameter settings achieve zero training error. The optimiser picks one; SGD's implicit bias selects a flat, low-norm minimum that generalises. More capacity doesn't add variance because the optimiser doesn't use it to fit noise.

Practical implications

The classical bias-variance frame is correct in three regimes:

  • Classical ML methods (linear/logistic regression, SVM, small MLPs) — model selection via validation curves, regularisation, early stopping all apply.
  • Small deep networks (under-parameterised) — same advice as classical.
  • Test-set evaluation in any regime — bias-variance still describes test error; the relationship to capacity is what changes.

For modern over-parameterised deep learning, the take-home is that you should train longer and larger than classical wisdom suggests, and validation curves can be misleading — a model that looks worse at the interpolation threshold may be much better past it.

Released under the MIT License. Content imported and adapted from NoteNextra.