Kernel Methods & The Kernel Trick
A kernel method replaces inner products
Feature maps and the trick
For an input space
Many algorithms — perceptron, ridge regression, SVM, PCA — depend on the data only through inner products. So we can replace every
Concretely: ordinary linear regression in
You only need the Gram matrix
What counts as a kernel: Mercer's theorem
A function
Useful operations preserve PSD-ness: sums, products, multiplication by a positive scalar, and composition with positive-coefficient power series. So you can build complex kernels from simple ones algebraically.
The standard menu
- Linear —
. Just standard linear methods. - Polynomial —
. Feature space contains all monomials up to degree . - Gaussian / RBF —
. Feature space is infinite-dimensional — corresponds to a Gaussian-smoothed comparison. - Laplacian —
. L1-based variant. - Sigmoid —
. Not always PSD; historically connected to neural networks.
The RBF kernel with cross-validated bandwidth
Representer theorem
For any regularised risk
with
The infinite-dimensional optimisation reduces to choosing
The flagship: kernel SVM
Kernel methods reached their peak in the SVM. The dual SVM optimisation is
Plug in any valid
Why kernels lost ground
Two reasons kernel methods are no longer the default:
- Scaling. Gram matrix is
. Past this becomes prohibitive without low-rank approximations (Nyström, random Fourier features) that introduce their own errors. - Representation learning. Kernels encode a fixed feature space. Deep networks learn a feature space from data, often discovering more useful representations than any hand-chosen kernel could produce. Once you can train a deep model, RBF features start to look quaint.
Kernel methods survive in:
- Small/medium structured-data problems (tabular regression, support-vector regression in chemistry).
- Theoretical analysis — the Neural Tangent Kernel (NTK; Jacot et al., 2018) shows that wide neural networks behave like kernel methods, providing a bridge between the two paradigms.
- Hybrid methods — e.g., Gaussian process regression with kernels, used in Bayesian optimisation.
What to read next
- SVM — the canonical kernel method.
- Kernel Era (history) — the rise and fall, in context.
- Deep Learning Renaissance — what displaced kernels.