PAC Learning
PAC learning — Probably Approximately Correct — is the framework that made "what does it mean to learn?" a precise mathematical question. Introduced by Valiant (1984), it asks: with high probability (
The PAC criterion
Fix a hypothesis class
The class
whenever the training set has size
The framework is distribution-free — the learner must work for any
The basic theorem
For finite hypothesis classes:
samples suffice. The proof is one paragraph: the probability that a "bad" hypothesis (
For infinite classes,
The fundamental theorem of statistical learning (Blumer et al., 1989) says: a class is PAC-learnable iff it has finite VC dimension.
Agnostic PAC learning
The realisable case (truth is in
The learner promises to be approximately competitive with the best hypothesis in the class, not with the truth. Sample complexity becomes
Computational efficiency
PAC learning is statistical; efficient PAC demands also polynomial-time learning algorithms. Some classes are statistically PAC-learnable but computationally hard:
- 3-DNF formulas — finite VC dimension but learning them efficiently is NP-hard (Pitt, Valiant 1988).
- Parity functions in noise — statistically learnable, conjectured computationally hard (LWE, related to lattice cryptography).
The gap between statistical and computational learnability is one of the deeper themes of learning theory and the source of cryptographic constructions like the Learning Parity with Noise assumption.
Realisable vs noisy settings
Three regimes:
- Realisable — exact
. Easiest case. - Random classification noise — labels flipped independently with probability
. Statistical query (SQ) algorithms handle this; perceptron with averaging works for separable data with random noise. - Adversarial / agnostic — labels arbitrary. Hardest; matches reality.
Most modern ML lives in the agnostic regime — labels are noisy, biased, or just imperfect, and we have no guarantee the model class contains the truth.
What PAC theory does and doesn't say about deep learning
PAC theory is necessary — without finite VC, no distribution-free learning. It is not sufficient — modern neural networks have vast VC dimension and PAC bounds are vacuous (see generalization). But:
- Distribution-dependent PAC bounds (PAC-Bayes, with priors over hypotheses) give non-vacuous predictions for some deep networks (Dziugaite & Roy, 2017).
- Algorithm-dependent bounds (stability-based, Hardt et al., 2016) bypass capacity entirely and bound generalisation in terms of optimiser properties.
Both are active areas — PAC learning's framing of "what counts as learning?" remains central even when the original VC bounds are too loose to be useful.
What to read next
- Generalization & VC Dimension — the capacity measures PAC bounds use.
- ERM — the canonical PAC-style algorithm.
- Statistical Learning Theory (history) — the historical context.