Empirical Risk Minimization
Empirical Risk Minimization (ERM) is the foundational principle of supervised learning: pick the hypothesis from a chosen class that minimises average loss on the training set, and hope that minimising training loss approximately minimises true loss on unseen data. Almost every supervised method in this curriculum is an instance of ERM with different choices of hypothesis class and loss.
The setup
Assume samples
We never observe
ERM picks
This is the algorithmic core — train by minimising the average training-set loss.
Why ERM works
ERM works because of two things working together:
- Concentration — for fixed
, as by the law of large numbers. With i.i.d. samples and bounded loss, Hoeffding's inequality gives an exponential rate. - Uniform convergence — for the whole class
, also goes to zero, provided is not too rich. This is the part that's non-trivial — it's what VC dimension and Rademacher complexity quantify.
Combining: if uniform convergence holds, then
The second term is the estimation error controlled by sample size and class complexity; the first is the approximation error — how good the best hypothesis in
Approximation vs estimation: the bias-variance link
The decomposition
is the formal version of the bias-variance tradeoff. A bigger
Structural Risk Minimization
Structural Risk Minimization (Vapnik, 1971) is the upgrade for choosing
where the penalty is a complexity term (VC-dimension based, Rademacher, or AIC/BIC). This is the principled answer to "how big should my model be?". Modern regularisation methods (ridge, lasso, weight decay) are continuous relaxations of SRM — penalising in-class parameter norm rather than discretely choosing a class.
ERM in practice
For most modern ML, ERM looks like:
- Hypothesis class — neural network of a chosen architecture.
- Loss — task-appropriate (cross-entropy, MSE, contrastive).
- Optimiser — SGD or Adam on
. - Regularisation / model selection — combination of weight decay, dropout, augmentation, cross-validation.
The classical theory assumes ERM finds the global minimum of
What to read next
- Bias-Variance Tradeoff — the same decomposition in different vocabulary.
- Generalization & VC Dimension — formal sample-complexity bounds.
- PAC Learning — ERM's success criterion in the PAC framework.