Skip to content

PAC Learning

PAC learning — Probably Approximately Correct — is the framework that made "what does it mean to learn?" a precise mathematical question. Introduced by Valiant (1984), it asks: with high probability (1δ) over the choice of training data, can the learner produce a hypothesis whose error is at most ϵ? When the answer is "yes for any ϵ,δ>0 given enough samples", the class is PAC-learnable.

The PAC criterion

Fix a hypothesis class H, an input distribution D on X, and a target concept c:X{0,1} (the realisable case — the truth lies in H). Given N i.i.d. samples (xi,c(xi)), the learner outputs h^.

The class H is PAC-learnable if there is a function N(ϵ,δ) such that for any D and any cH,

PDN[R(h^)ϵ]1δ

whenever the training set has size N(ϵ,δ). The function N(ϵ,δ) is the sample complexity.

The framework is distribution-free — the learner must work for any D — and worst-case — must work for any cH.

The basic theorem

For finite hypothesis classes:

N(ϵ,δ)=O(1ϵlog|H|δ)

samples suffice. The proof is one paragraph: the probability that a "bad" hypothesis (R(h)>ϵ) survives ERM training is (1ϵ)N; union bound over H. The take-home: sample complexity grows logarithmically with class size. Doubling |H| adds a constant to N, not a multiplicative factor.

For infinite classes, log|H| is replaced by VC dimension:

N(ϵ,δ)=O(dlog(1/ϵ)+log(1/δ)ϵ),d=VC(H).

The fundamental theorem of statistical learning (Blumer et al., 1989) says: a class is PAC-learnable iff it has finite VC dimension.

Agnostic PAC learning

The realisable case (truth is in H) is unrealistic. Agnostic PAC drops the assumption: the data distribution can be arbitrary, with no constraint that any hH achieves zero error. The criterion becomes:

P[R(h^)minhHR(h)+ϵ]1δ.

The learner promises to be approximately competitive with the best hypothesis in the class, not with the truth. Sample complexity becomes N=O~(d/ϵ2) — quadratic in 1/ϵ instead of linear, the price for not assuming realisability.

Computational efficiency

PAC learning is statistical; efficient PAC demands also polynomial-time learning algorithms. Some classes are statistically PAC-learnable but computationally hard:

  • 3-DNF formulas — finite VC dimension but learning them efficiently is NP-hard (Pitt, Valiant 1988).
  • Parity functions in noise — statistically learnable, conjectured computationally hard (LWE, related to lattice cryptography).

The gap between statistical and computational learnability is one of the deeper themes of learning theory and the source of cryptographic constructions like the Learning Parity with Noise assumption.

Realisable vs noisy settings

Three regimes:

  • Realisable — exact cH. Easiest case.
  • Random classification noise — labels flipped independently with probability η<1/2. Statistical query (SQ) algorithms handle this; perceptron with averaging works for separable data with random noise.
  • Adversarial / agnostic — labels arbitrary. Hardest; matches reality.

Most modern ML lives in the agnostic regime — labels are noisy, biased, or just imperfect, and we have no guarantee the model class contains the truth.

What PAC theory does and doesn't say about deep learning

PAC theory is necessary — without finite VC, no distribution-free learning. It is not sufficient — modern neural networks have vast VC dimension and PAC bounds are vacuous (see generalization). But:

  • Distribution-dependent PAC bounds (PAC-Bayes, with priors over hypotheses) give non-vacuous predictions for some deep networks (Dziugaite & Roy, 2017).
  • Algorithm-dependent bounds (stability-based, Hardt et al., 2016) bypass capacity entirely and bound generalisation in terms of optimiser properties.

Both are active areas — PAC learning's framing of "what counts as learning?" remains central even when the original VC bounds are too loose to be useful.

Released under the MIT License. Content imported and adapted from NoteNextra.