Skip to content

Support Vector Machines (SVM)

SVMs were the dominant supervised classifier from roughly 1995 to 2012 — the period between the perceptron's revival and the AlexNet moment. They are still the right answer for many small-data, structured problems, and the kernel trick they popularized is reborn in modern attention mechanisms.

The geometric idea

Given linearly separable data {(xi,yi)}i=1n with yi{1,+1}, find the hyperplane wx+b=0 that maximizes the margin — the distance to the nearest training point.

This gives the hard-margin primal:

minw,b12w2s.t.yi(wxi+b)1,i.

Soft margin (Cortes & Vapnik, 1995)

Real data is rarely separable. Introduce slack ξi0:

minw,b,ξ12w2+Ci=1nξis.t.yi(wxi+b)1ξi.

The hyperparameter C trades off margin width against training error.

The dual & support vectors

The Lagrangian dual is

maxαiαi12i,jαiαjyiyjxixjs.t.0αiC,iαiyi=0.

Only points with αi>0 matter — these are the support vectors. The dual depends on xi only through inner products xixj, which is exactly what enables the kernel trick.

The kernel trick

Replace xixj with a positive-definite kernel k(xi,xj)=ϕ(xi),ϕ(xj) that implicitly maps inputs to a high-dimensional feature space. Common choices:

  • Linear: k(x,x)=xx
  • Polynomial: k(x,x)=(xx+c)d
  • RBF / Gaussian: k(x,x)=exp(γxx2)

The RBF kernel makes SVMs universal approximators on bounded inputs.

Why SVMs still matter

  1. They give a principled large-margin classifier for tabular and small-sample problems.
  2. The dual / kernel formulation generalises to many other tasks (kernel PCA, Gaussian processes, kernel ridge regression).
  3. Modern self-attention is, viewed at one angle, a learned kernel similarity between tokens — see Transformer Era · 2017.

Stub status

Seed introduction. Expand with hinge loss / sub-gradient view, SMO solver, multi-class extensions, and structural SVMs.

Released under the MIT License. Content imported and adapted from NoteNextra.