Skip to content

Kernel Methods & The Kernel Trick

A kernel method replaces inner products x,x with a kernel function K(x,x) that implicitly computes inner products in some (potentially infinite-dimensional) feature space. Without ever materialising the feature map, you get to do non-linear classification, regression, and density estimation. This was the dominant idea in machine learning from 1995 to about 2010 — the kernel era.

Feature maps and the trick

For an input space X and feature map ϕ:XF into a Hilbert space F, define K(x,x)=ϕ(x),ϕ(x).

Many algorithms — perceptron, ridge regression, SVM, PCA — depend on the data only through inner products. So we can replace every xi,xj with K(xi,xj), get a non-linear method, and never compute ϕ explicitly. This is the kernel trick.

Concretely: ordinary linear regression in ϕ-space gives a function f(x)=w,ϕ(x). By the representer theorem, the optimum has w=iαiϕ(xi), so

f(x)=iαiK(xi,x).

You only need the Gram matrix Kij=K(xi,xj) — never the explicit features.

What counts as a kernel: Mercer's theorem

A function K:X×XR is a valid kernel iff it is symmetric and positive semi-definite: for any finite set of inputs, the Gram matrix is PSD. Mercer's theorem then guarantees the existence of a feature space F and map ϕ with K(x,x)=ϕ(x),ϕ(x).

Useful operations preserve PSD-ness: sums, products, multiplication by a positive scalar, and composition with positive-coefficient power series. So you can build complex kernels from simple ones algebraically.

The standard menu

  • LinearK(x,x)=xx. Just standard linear methods.
  • PolynomialK(x,x)=(xx+c)d. Feature space contains all monomials up to degree d.
  • Gaussian / RBFK(x,x)=exp(xx2/2σ2). Feature space is infinite-dimensional — corresponds to a Gaussian-smoothed comparison.
  • LaplacianK(x,x)=exp(xx1/σ). L1-based variant.
  • SigmoidK(x,x)=tanh(αxx+c). Not always PSD; historically connected to neural networks.

The RBF kernel with cross-validated bandwidth σ is the universal "I don't know what kernel to use" default.

Representer theorem

For any regularised risk

minfi(yi,f(xi))+λfH2,

with H a Reproducing Kernel Hilbert Space (RKHS) for kernel K, the optimum has the form

f(x)=i=1NαiK(xi,x).

The infinite-dimensional optimisation reduces to choosing N scalars αi. Combined with the kernel trick, this is what makes kernel methods computationally tractable — but also what makes them scale poorly: the Gram matrix is N×N, and N2 memory is the practical ceiling.

The flagship: kernel SVM

Kernel methods reached their peak in the SVM. The dual SVM optimisation is

maxαiαi12ijαiαjyiyjK(xi,xj)s.t.0αiC.

Plug in any valid K and you get a non-linear classifier. RBF SVMs were the de-facto best classifier on most tabular datasets from 2000 to about 2010, only displaced by gradient-boosted trees and (much later) deep networks.

Why kernels lost ground

Two reasons kernel methods are no longer the default:

  • Scaling. Gram matrix is N×N. Past N105 this becomes prohibitive without low-rank approximations (Nyström, random Fourier features) that introduce their own errors.
  • Representation learning. Kernels encode a fixed feature space. Deep networks learn a feature space from data, often discovering more useful representations than any hand-chosen kernel could produce. Once you can train a deep model, RBF features start to look quaint.

Kernel methods survive in:

  • Small/medium structured-data problems (tabular regression, support-vector regression in chemistry).
  • Theoretical analysis — the Neural Tangent Kernel (NTK; Jacot et al., 2018) shows that wide neural networks behave like kernel methods, providing a bridge between the two paradigms.
  • Hybrid methods — e.g., Gaussian process regression with kernels, used in Bayesian optimisation.

Released under the MIT License. Content imported and adapted from NoteNextra.