Generalization & VC Dimension
ERM works only if the gap between training and true error vanishes uniformly over the hypothesis class. The size of that gap depends on a complexity measure of the class — what we will call its capacity. VC dimension is the canonical capacity measure for classification; Rademacher complexity is the modern data-dependent successor. Together they provide the formal vocabulary for "this class is too rich to learn from
What we want
A generalisation bound is a statement of the form: with probability
Capacity matters; sample size matters more (the
The growth function and shattering
For a binary-classifier class
It counts, in the worst case, how many distinct labellings of
VC dimension
The Vapnik-Chervonenkis dimension
- Threshold functions on
— VC = 1. - Linear classifiers in
— VC = . - Axis-aligned rectangles in
— VC = 4. - Neural network with
weights — VC = for typical activations.
Sauer's lemma then bounds the growth function in terms of VC dimension:
so the growth function is polynomial in
The VC bound
The headline result (Vapnik, Chervonenkis 1971): with probability
with
Rademacher complexity
A more refined, data-dependent capacity measure is the Rademacher complexity
with
The deep-learning paradox
Deep networks have enormous VC dimension — typically
Understanding Deep Learning Requires Rethinking Generalization (Zhang, Bengio, Hardt, Recht, Vinyals, ICLR 2017) demonstrated this concretely: deep networks fit random labels on CIFAR-10 to zero training error (so VC capacity is genuinely huge) but on real labels they generalise. Standard generalisation bounds become vacuous.
The resolution is partial:
- Implicit bias of SGD — among the many zero-train-loss solutions, SGD finds flat, low-norm ones (see double descent).
- Algorithm-dependent bounds — PAC-Bayes and stability-based bounds depend on the optimiser as well as the class, and give non-vacuous predictions for some deep networks.
- Data-dependent capacity — Rademacher complexity restricted to low-norm sub-class can be much smaller than worst-case VC.
The full theoretical picture is still incomplete, but the framework — capacity controls generalisation — survives. What changed is the right notion of capacity for over-parameterised networks.
What to read next
- PAC Learning — the framework VC bounds plug into.
- ERM — the algorithm whose success VC theory analyses.
- Double Descent — the modern correction to classical generalisation bounds.