Convex Optimization Basics
A convex optimisation problem has a convex objective on a convex feasible set. Such problems are easy in a precise sense: any local minimum is a global one, the dual lower-bounds the primal cleanly, and efficient algorithms exist. Most pre-deep-learning ML — OLS, ridge, lasso, SVM, logistic regression — fits inside this framework. Even when modern deep learning lives outside it, convex analysis is the right vocabulary for understanding what's not working.
Convex sets and functions
A set
A function
For twice-differentiable
Useful operations preserve convexity: non-negative weighted sums, max, composition with affine maps. Logsumexp
The convex problem
A standard-form convex problem is
with
- Linear program (LP) — linear
and linear . Solvable to optimality at scale (commercial solvers handle variables). - Quadratic program (QP) — quadratic
, linear . SVM dual is a QP. Lasso is a QP. - Semidefinite program (SDP) — variables include matrices, with PSD constraints. Used in some kernel learning, control, and sparse-recovery literature.
Why convexity matters
Two structural results:
- Local = global. A local minimum of a convex function is a global minimum. Gradient descent cannot get stuck in suboptimal basins.
- First-order conditions are sufficient.
on an unconstrained convex problem certifies optimality. With constraints, the KKT conditions generalise this: stationarity, primal/dual feasibility, complementary slackness. Solving KKT solves the problem.
In non-convex problems (deep networks), neither holds. There can be many local minima, all stationary, with very different objective values. Why deep networks generalise well despite this is the central puzzle that double descent and implicit-bias work address.
Lagrangian duality
For a constrained problem
The dual function
The most consequential application: SVM (see SVM). Solving its dual reveals the kernel structure — predictions depend on inner products, not raw features — which is what unlocks the kernel trick.
Gradient methods and convergence rates
For smooth convex
For non-smooth convex (lasso's
Smooth, strongly convex problems are easy
The take-home for ML: classical methods (OLS, ridge, logistic, SVM) are smooth strongly convex problems with closed-form or rapidly-converging solvers. They were never the bottleneck of pre-2012 ML — feature engineering was. Modern deep learning gave up convexity for representation power; understanding what was lost (the local-minimum guarantee) is the point of this page.
What to read next
- OLS — the simplest closed-form convex problem.
- SVM — convex QP whose dual unlocks kernels.
- SGD, Momentum, Nesterov — Nesterov in the stochastic deep-learning regime.