Skip to content

Generalization & VC Dimension

ERM works only if the gap between training and true error vanishes uniformly over the hypothesis class. The size of that gap depends on a complexity measure of the class — what we will call its capacity. VC dimension is the canonical capacity measure for classification; Rademacher complexity is the modern data-dependent successor. Together they provide the formal vocabulary for "this class is too rich to learn from N samples".

What we want

A generalisation bound is a statement of the form: with probability 1δ over a training set of size N,

suphH|R^S(h)R(h)|O(capacity(H)N).

Capacity matters; sample size matters more (the N floor); the dimension of the input does not appear directly. This is the core insight: complexity controls generalisation, not raw input dimension.

The growth function and shattering

For a binary-classifier class H on input space X, the growth function is

ΠH(N)=maxx1,,xNX|{(h(x1),,h(xN)):hH}|.

It counts, in the worst case, how many distinct labellings of N points the class can produce. It is at most 2N. A set of N points is shattered by H if all 2N labellings are achievable.

VC dimension

The Vapnik-Chervonenkis dimension VC(H) is the largest N such that some set of N points is shattered by H. Examples:

  • Threshold functions on R — VC = 1.
  • Linear classifiers in Rd — VC = d+1.
  • Axis-aligned rectangles in R2 — VC = 4.
  • Neural network with W weights — VC = Θ(W2) for typical activations.

Sauer's lemma then bounds the growth function in terms of VC dimension:

ΠH(N)i=0d(Ni)(eNd)d,d=VC(H),

so the growth function is polynomial in N once N>d — even though it could in principle be exponential. This polynomial-vs-exponential gap is what gives finite-VC classes finite-sample learnability.

The VC bound

The headline result (Vapnik, Chervonenkis 1971): with probability 1δ,

suphH|R^S(h)R(h)|O(dlog(N/d)+log(1/δ)N),

with d=VC(H). Sample complexity to learn within tolerance ϵ is N=O~(d/ϵ2). This is the formal "you need samples to match capacity" theorem, and the basis of PAC learning.

Rademacher complexity

A more refined, data-dependent capacity measure is the Rademacher complexity

RN(H)=ES,σ[suphH1Ni=1Nσih(xi)],

with σi{1,+1} independent random signs. It measures how well the class can fit random labels — high Rademacher complexity = the class can memorise noise = poor generalisation. Bounds analogous to VC-bound replace the d/N term with RN(H), and are typically tighter for real distributions.

The deep-learning paradox

Deep networks have enormous VC dimension — typically Ω(WlogW) in the weight count W, often exceeding the dataset size by orders of magnitude. Classical bounds predict no generalisation. But they generalise.

Understanding Deep Learning Requires Rethinking Generalization (Zhang, Bengio, Hardt, Recht, Vinyals, ICLR 2017) demonstrated this concretely: deep networks fit random labels on CIFAR-10 to zero training error (so VC capacity is genuinely huge) but on real labels they generalise. Standard generalisation bounds become vacuous.

The resolution is partial:

  • Implicit bias of SGD — among the many zero-train-loss solutions, SGD finds flat, low-norm ones (see double descent).
  • Algorithm-dependent boundsPAC-Bayes and stability-based bounds depend on the optimiser as well as the class, and give non-vacuous predictions for some deep networks.
  • Data-dependent capacity — Rademacher complexity restricted to low-norm sub-class can be much smaller than worst-case VC.

The full theoretical picture is still incomplete, but the framework — capacity controls generalisation — survives. What changed is the right notion of capacity for over-parameterised networks.

  • PAC Learning — the framework VC bounds plug into.
  • ERM — the algorithm whose success VC theory analyses.
  • Double Descent — the modern correction to classical generalisation bounds.

Released under the MIT License. Content imported and adapted from NoteNextra.