Kernel Methods & The Kernel Trick

A kernel method replaces inner products $⟨ x, x^{'} ⟩$ with a kernel function $K (x, x^{'})$ that implicitly computes inner products in some (potentially infinite-dimensional) feature space. Without ever materialising the feature map, you get to do non-linear classification, regression, and density estimation. This was the dominant idea in machine learning from 1995 to about 2010 — the kernel era.

Feature maps and the trick

For an input space $X$ and feature map $ϕ : X \to F$ into a Hilbert space $F$ , define $K (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩$ .

Many algorithms — perceptron, ridge regression, SVM, PCA — depend on the data only through inner products. So we can replace every $⟨ x_{i}, x_{j} ⟩$ with $K (x_{i}, x_{j})$ , get a non-linear method, and never compute $ϕ$ explicitly. This is the kernel trick.

Concretely: ordinary linear regression in $ϕ$ -space gives a function $f (x) = ⟨ w, ϕ (x) ⟩$ . By the representer theorem, the optimum has $w^{*} = \sum_{i} α_{i} ϕ (x_{i})$ , so

f (x) = \sum_{i} α_{i} K (x_{i}, x) .

You only need the Gram matrix $K_{i j} = K (x_{i}, x_{j})$ — never the explicit features.

What counts as a kernel: Mercer's theorem

A function $K : X \times X \to R$ is a valid kernel iff it is symmetric and positive semi-definite: for any finite set of inputs, the Gram matrix is PSD. Mercer's theorem then guarantees the existence of a feature space $F$ and map $ϕ$ with $K (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩$ .

Useful operations preserve PSD-ness: sums, products, multiplication by a positive scalar, and composition with positive-coefficient power series. So you can build complex kernels from simple ones algebraically.

Linear — $K (x, x^{'}) = x^{⊤} x^{'}$ . Just standard linear methods.
Polynomial — $K (x, x^{'}) = (x^{⊤} x^{'} + c)^{d}$ . Feature space contains all monomials up to degree $d$ .
Gaussian / RBF — $K (x, x^{'}) = \exp (- ∥ x - x^{'} ∥^{2} / 2 σ^{2})$ . Feature space is infinite-dimensional — corresponds to a Gaussian-smoothed comparison.
Laplacian — $K (x, x^{'}) = \exp (- ∥ x - x^{'} ∥_{1} / σ)$ . L1-based variant.
Sigmoid — $K (x, x^{'}) = \tanh (α x^{⊤} x^{'} + c)$ . Not always PSD; historically connected to neural networks.

The RBF kernel with cross-validated bandwidth $σ$ is the universal "I don't know what kernel to use" default.

Representer theorem

For any regularised risk

min_{f} \sum_{i} ℓ (y_{i}, f (x_{i})) + λ ∥ f ∥_{H}^{2},

with $H$ a Reproducing Kernel Hilbert Space (RKHS) for kernel $K$ , the optimum has the form

f^{*} (x) = \sum_{i = 1}^{N} α_{i} K (x_{i}, x) .

The infinite-dimensional optimisation reduces to choosing $N$ scalars $α_{i}$ . Combined with the kernel trick, this is what makes kernel methods computationally tractable — but also what makes them scale poorly: the Gram matrix is $N \times N$ , and $N^{2}$ memory is the practical ceiling.

The flagship: kernel SVM

Kernel methods reached their peak in the SVM. The dual SVM optimisation is

max_{α} \sum_{i} α_{i} - \frac{1}{2} \sum_{i j} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) s.t. 0 \leq α_{i} \leq C .

Plug in any valid $K$ and you get a non-linear classifier. RBF SVMs were the de-facto best classifier on most tabular datasets from 2000 to about 2010, only displaced by gradient-boosted trees and (much later) deep networks.

Why kernels lost ground

Two reasons kernel methods are no longer the default:

Scaling. Gram matrix is $N \times N$ . Past $N \approx 10^{5}$ this becomes prohibitive without low-rank approximations (Nyström, random Fourier features) that introduce their own errors.
Representation learning. Kernels encode a fixed feature space. Deep networks learn a feature space from data, often discovering more useful representations than any hand-chosen kernel could produce. Once you can train a deep model, RBF features start to look quaint.

Kernel methods survive in:

Small/medium structured-data problems (tabular regression, support-vector regression in chemistry).
Theoretical analysis — the Neural Tangent Kernel (NTK; Jacot et al., 2018) shows that wide neural networks behave like kernel methods, providing a bridge between the two paradigms.
Hybrid methods — e.g., Gaussian process regression with kernels, used in Bayesian optimisation.

Kernel Methods & The Kernel Trick

Feature maps and the trick

What counts as a kernel: Mercer's theorem

The standard menu

Representer theorem

The flagship: kernel SVM

Why kernels lost ground

What to read next

Kernel Methods & The Kernel Trick ​

Feature maps and the trick ​

What counts as a kernel: Mercer's theorem ​

The standard menu ​

Representer theorem ​

The flagship: kernel SVM ​

Why kernels lost ground ​

What to read next ​

Kernel Methods & The Kernel Trick

Feature maps and the trick

What counts as a kernel: Mercer's theorem

The standard menu

Representer theorem

The flagship: kernel SVM

Why kernels lost ground

What to read next