Support Vector Machines (SVM)

SVMs were the dominant supervised classifier from roughly 1995 to 2012 — the period between the perceptron's revival and the AlexNet moment. They are still the right answer for many small-data, structured problems, and the kernel trick they popularized is reborn in modern attention mechanisms.

The geometric idea

Given linearly separable data ${(x_{i}, y_{i})}_{i = 1}^{n}$ with $y_{i} \in {- 1, + 1}$ , find the hyperplane $w^{⊤} x + b = 0$ that maximizes the margin — the distance to the nearest training point.

This gives the hard-margin primal:

min_{w, b} \frac{1}{2} ‖ w ‖^{2} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1, \forall i .

Soft margin (Cortes & Vapnik, 1995)

Real data is rarely separable. Introduce slack $ξ_{i} \geq 0$ :

min_{w, b, ξ} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} ξ_{i} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1 - ξ_{i} .

The hyperparameter $C$ trades off margin width against training error.

The dual & support vectors

The Lagrangian dual is

max_{α} \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j} s.t. 0 \leq α_{i} \leq C, \sum_{i} α_{i} y_{i} = 0.

Only points with $α_{i} > 0$ matter — these are the support vectors. The dual depends on $x_{i}$ only through inner products $x_{i}^{⊤} x_{j}$ , which is exactly what enables the kernel trick.

The kernel trick

Replace $x_{i}^{⊤} x_{j}$ with a positive-definite kernel $k (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j}) ⟩$ that implicitly maps inputs to a high-dimensional feature space. Common choices:

Linear: $k (x, x^{'}) = x^{⊤} x^{'}$
Polynomial: $k (x, x^{'}) = (x^{⊤} x^{'} + c)^{d}$
RBF / Gaussian: $k (x, x^{'}) = \exp (- γ ‖ x - x^{'} ‖^{2})$

The RBF kernel makes SVMs universal approximators on bounded inputs.

Why SVMs still matter

They give a principled large-margin classifier for tabular and small-sample problems.
The dual / kernel formulation generalises to many other tasks (kernel PCA, Gaussian processes, kernel ridge regression).
Modern self-attention is, viewed at one angle, a learned kernel similarity between tokens — see Transformer Era · 2017.

Support Vector Machines (SVM) ​

The geometric idea ​

Soft margin (Cortes & Vapnik, 1995) ​

The dual & support vectors ​

The kernel trick ​

Why SVMs still matter ​

What to read next ​