Skip to content

Empirical Risk Minimization

Empirical Risk Minimization (ERM) is the foundational principle of supervised learning: pick the hypothesis from a chosen class that minimises average loss on the training set, and hope that minimising training loss approximately minimises true loss on unseen data. Almost every supervised method in this curriculum is an instance of ERM with different choices of hypothesis class and loss.

The setup

Assume samples (x,y) are drawn i.i.d. from an unknown distribution D. A hypothesis class H is the set of candidate predictors (linear models, decision trees, neural nets of fixed architecture). A loss function (h(x),y) scores how badly hypothesis h predicts y. The true risk of h is

R(h)=E(x,y)D[(h(x),y)].

We never observe D, so we cannot evaluate R(h) directly. Given a training set S={(xi,yi)}i=1N, the empirical risk is

R^S(h)=1Ni=1N(h(xi),yi).

ERM picks

h^=argminhHR^S(h).

This is the algorithmic core — train by minimising the average training-set loss.

Why ERM works

ERM works because of two things working together:

  1. Concentration — for fixed h, R^S(h)R(h) as N by the law of large numbers. With i.i.d. samples and bounded loss, Hoeffding's inequality gives an exponential rate.
  2. Uniform convergence — for the whole class H, suphH|R^S(h)R(h)| also goes to zero, provided H is not too rich. This is the part that's non-trivial — it's what VC dimension and Rademacher complexity quantify.

Combining: if uniform convergence holds, then h^ — chosen by ERM — has true risk close to the best in the class. Formally, with high probability:

R(h^)minhHR(h)+2suphH|R^S(h)R(h)|.

The second term is the estimation error controlled by sample size and class complexity; the first is the approximation error — how good the best hypothesis in H is.

The decomposition

R(h^)=minhHR(h)approximation+R(h^)minhHR(h)estimation

is the formal version of the bias-variance tradeoff. A bigger H lowers approximation error (better best-in-class) but raises estimation error (harder to find the best from limited data). The art of model selection is balancing them.

Structural Risk Minimization

Structural Risk Minimization (Vapnik, 1971) is the upgrade for choosing H itself. Instead of fixing one class, consider a sequence H1H2 of increasing complexity (e.g., polynomial degree 1, 2, 3, ...). For each k, ERM gives h^k. SRM picks the level k minimising

R^S(h^k)+penalty(Hk,N),

where the penalty is a complexity term (VC-dimension based, Rademacher, or AIC/BIC). This is the principled answer to "how big should my model be?". Modern regularisation methods (ridge, lasso, weight decay) are continuous relaxations of SRM — penalising in-class parameter norm rather than discretely choosing a class.

ERM in practice

For most modern ML, ERM looks like:

  • Hypothesis class — neural network of a chosen architecture.
  • Loss — task-appropriate (cross-entropy, MSE, contrastive).
  • OptimiserSGD or Adam on R^S.
  • Regularisation / model selection — combination of weight decay, dropout, augmentation, cross-validation.

The classical theory assumes ERM finds the global minimum of R^S. For convex problems this is true; for deep networks, SGD finds something else — a flat, low-norm minimum that the implicit-bias literature is still working to characterise. The empirical fact is that this something else generalises remarkably well, even though classical ERM theory does not directly explain why.

Released under the MIT License. Content imported and adapted from NoteNextra.