Statistical Learning Theory (1990s)

The 1990s reframed machine learning around a precise mathematical question — when can you learn? — and gave it a precise answer in the form of PAC bounds, VC dimension, and the broader statistical learning theory framework. The era's intellectual centre was Vladimir Vapnik's The Nature of Statistical Learning Theory (1995), and its concrete contribution to applied ML was the SVM.

The intellectual setting

By the late 1980s, neural networks worked empirically but had no general theory. Symbolic AI had failed to scale. The community wanted a framework that:

Defined "learning" formally enough to prove things about it.
Connected empirical performance (training error) to true performance (test error).
Gave a recipe for choosing model complexity in a principled way.

Vapnik and Chervonenkis had been developing such a framework in the Soviet Union since the late 1960s; the field rediscovered it in the West around 1990, primarily through Vapnik's emigration to AT&T Bell Labs.

Core results

The framework had several pillars:

PAC learning (Valiant, 1984) — formalises "learnable" as: with enough samples, with high probability, the learner produces a hypothesis with small error.
VC dimension (Vapnik, Chervonenkis, 1971) — the right capacity measure for binary-classification hypothesis classes. Generalises to Rademacher complexity for richer settings.
VC bound — sample complexity to learn within tolerance $ϵ$ scales as $\tilde{O} (d_{VC} / ϵ^{2})$ in the agnostic setting.
Structural Risk Minimisation — choose hypothesis-class complexity by trading data fit against an explicit complexity penalty.

These results gave ML its first general distribution-free generalisation theory. Before VC theory, there was no principled answer to "is more data enough?"; afterwards, you could prove it was, with quantitative rates.

SVMs as the flagship application

The Support Vector Machine (Boser, Guyon, Vapnik, 1992; Cortes & Vapnik, 1995) was the era's poster algorithm. It married statistical learning theory with the kernel trick:

Maximum-margin classification — geometric intuition with VC-theoretic bounds (margin generalisation bounds, Bartlett 1998).
Convex optimisation — quadratic programming with global guarantees, unlike neural-network local minima.
Kernel trick — non-linear classification without explicit feature maps.

SVMs dominated benchmark classification through the 2000s, particularly on text categorisation, bioinformatics, and any domain with structured-but-not-too-large datasets.

The 1990s ML stack

The 1990s/early-2000s applied-ML toolkit:

SVMs for classification.
Decision trees and ensembles (random forests, AdaBoost) for tabular data.
HMMs for speech and sequences.
Hand-crafted features (SIFT, MFCC, n-gram counts, TF-IDF) feeding all of the above.

This stack was mathematically rigorous, theoretically grounded, and empirically strong on small/medium data — better than neural networks of the time on most benchmarks. The ML community took statistical and theoretical legitimacy seriously, partly in reaction to symbolic AI's overpromising.

What it got wrong about deep learning

The framework's predictions for the deep-learning era turned out badly:

Capacity-based bounds are vacuous for over-parameterised neural networks — they predict no generalisation, but networks generalise.
The bias-variance U-curve breaks — modern networks live past the interpolation threshold (see double descent).
Convexity is not necessary for practical optimisation — SGD on non-convex deep-network losses works astoundingly well.

The community has spent the 2010s and 2020s rebuilding parts of statistical learning theory for the over-parameterised regime — PAC-Bayes, neural-tangent-kernel theory, stability-based bounds. The framework as Vapnik articulated it was too restrictive, but the questions it asked — when does training error predict test error, what is the right notion of capacity — remain central.

Legacy

What survives:

The vocabulary — empirical risk, true risk, generalisation, capacity, regularisation. Modern deep learning still talks this way.
The principles — held-out evaluation, cross-validation, regularisation. SLT made them rigorous.
The algorithms — SVMs, kernel methods, structured-prediction CRFs. Still useful in their domains.

What was overturned:

The belief that ML is only statistical learning theory plus convex optimisation.
The bias-variance U-curve as the universal model selection picture.

Statistical Learning Theory (1990s) ​

The intellectual setting ​

Core results ​

SVMs as the flagship application ​

The 1990s ML stack ​

What it got wrong about deep learning ​

Legacy ​

What to read next ​