Support Vector Machines (SVM)
SVMs were the dominant supervised classifier from roughly 1995 to 2012 — the period between the perceptron's revival and the AlexNet moment. They are still the right answer for many small-data, structured problems, and the kernel trick they popularized is reborn in modern attention mechanisms.
The geometric idea
Given linearly separable data
This gives the hard-margin primal:
Soft margin (Cortes & Vapnik, 1995)
Real data is rarely separable. Introduce slack
The hyperparameter
The dual & support vectors
The Lagrangian dual is
Only points with
The kernel trick
Replace
- Linear:
- Polynomial:
- RBF / Gaussian:
The RBF kernel makes SVMs universal approximators on bounded inputs.
Why SVMs still matter
- They give a principled large-margin classifier for tabular and small-sample problems.
- The dual / kernel formulation generalises to many other tasks (kernel PCA, Gaussian processes, kernel ridge regression).
- Modern self-attention is, viewed at one angle, a learned kernel similarity between tokens — see Transformer Era · 2017.
What to read next
- Kernel Methods & The Kernel Trick
- The Perceptron — the SVM's ancestor.
- The Kernel Era (1995–2010) — historical context.
Stub status
Seed introduction. Expand with hinge loss / sub-gradient view, SMO solver, multi-class extensions, and structural SVMs.