Skip to content

The Kernel Era (1995–2010)

For about 15 years, kernel methods were the dominant paradigm in machine learning. SVMs, kernel ridge regression, kernel PCA, Gaussian processes — all built on the same idea of replacing inner products with kernels — were the state-of-the-art on most benchmarks from 1995 to 2010, and the conceptual centre of academic ML. Then deep learning displaced them, but the era's ideas, vocabulary, and pedagogical clarity survive.

What started it

Three papers in 1992 set the stage:

  • Boser, Guyon, VapnikA Training Algorithm for Optimal Margin Classifiers — introduced the SVM and the kernel trick.
  • Schölkopf et al.Comparing support vector machines with Gaussian kernels to radial basis function classifiers — showed RBF-kernel SVMs beat hand-crafted RBF networks.
  • Cortes & Vapnik (1995)Support Vector Networks — the soft-margin SVM, removing the linear-separability requirement.

By 1998, the Kernel Methods workshop was a regular fixture at NeurIPS. By 2002, Schölkopf and Smola's Learning with Kernels was the canonical textbook.

The core insight

Once you've reformulated an algorithm to depend only on inner products xi,xj, you can replace those products with a kernel function K(xi,xj). The algorithm then operates in a (potentially infinite-dimensional) feature space without ever materialising the features. By the representer theorem, the optimum has a finite-sum representation iαiK(xi,x), so the infinite-dimensional optimisation reduces to choosing N scalars.

This applies to:

  • Classification (SVM).
  • Regression (kernel ridge, Gaussian processes).
  • Clustering (kernel k-means, spectral clustering).
  • Dimensionality reduction (kernel PCA).
  • Anomaly detection (one-class SVM).
  • Independence testing (HSIC, kernel two-sample tests).

A whole sub-field formed around designing kernels for structured objects — strings, trees, graphs, sets, sequences — while leaving the core algorithm unchanged.

Theoretical legitimacy

Kernel methods came with statistical-learning-theory guarantees:

  • Convex optimisation — global optima, polynomial-time training.
  • Margin-based generalisation bounds — the wider the margin, the better the generalisation.
  • VC theory — finite VC dimension despite infinite feature space (when properly regularised).

For an academic field that valued mathematical rigour, this was deeply attractive. SVMs were a theorem-friendly algorithm in a way neural networks of the era were not.

Practical strengths

  • Strong out-of-the-box performance on small/medium structured data — text categorisation, gene-expression classification, hand-written digit recognition, OCR.
  • Limited tuning required — the bandwidth hyperparameter and regularisation strength are usually all you need to cross-validate.
  • Mature software ecosystem — libsvm, libsvm-derived kernels in R/Python.
  • Modular — pick a kernel for your data type, plug into any kernel algorithm.

What killed it

Two things, simultaneously:

  • Scaling. The Gram matrix is N×N. Past N105, both memory and per-query cost become prohibitive. Approximations (Nyström, random Fourier features, sparse SVMs) introduced their own errors.
  • Representation learning. Kernels encode a fixed feature space chosen by the practitioner. Deep networks learn their feature space from data — and on tasks with abundant data and natural structure (vision, audio, NLP), the learned features dominate any hand-chosen kernel.

The 2012 ImageNet result was the public turning point. By 2014, every major vision conference paper used CNNs; by 2018, NLP had fully transitioned. Kernel methods retreated to small-data, structured-input, or theory-of-deep-learning niches.

What survived

  • As a baseline — RBF SVMs are still a strong baseline in tabular and small-data settings.
  • Gaussian processes — Bayesian regression with kernel covariance, used in Bayesian optimisation, robotics, hyperparameter tuning.
  • Spectral methods — kernel PCA, spectral clustering, manifold learning, Laplacian eigenmaps.
  • The Neural Tangent Kernel — wide neural networks behave like kernel methods (Jacot et al., 2018), giving a bridge between the two paradigms.

The kernel era's pedagogical legacy is durable: kernels remain the cleanest way to teach high-dimensional generalisation, the role of regularisation, and the connection between optimisation and statistics.

Released under the MIT License. Content imported and adapted from NoteNextra.