Skip to content

Convex Optimization Basics

A convex optimisation problem has a convex objective on a convex feasible set. Such problems are easy in a precise sense: any local minimum is a global one, the dual lower-bounds the primal cleanly, and efficient algorithms exist. Most pre-deep-learning ML — OLS, ridge, lasso, SVM, logistic regression — fits inside this framework. Even when modern deep learning lives outside it, convex analysis is the right vocabulary for understanding what's not working.

Convex sets and functions

A set CRn is convex if for any x,yC and λ[0,1], λx+(1λ)yC. The line segment between any two points stays in the set.

A function f:RnR is convex if its epigraph (the set of points above its graph) is convex, equivalently:

f(λx+(1λ)y)λf(x)+(1λ)f(y).

For twice-differentiable f, convexity is equivalent to the Hessian being positive semi-definite everywhere. Strict convexity (< in the inequality) gives a unique minimum.

Useful operations preserve convexity: non-negative weighted sums, max, composition with affine maps. Logsumexp logiexi is convex; that's why softmax + cross-entropy is a convex loss in the logits.

The convex problem

A standard-form convex problem is

minxf(x)s.t.gi(x)0,Ax=b,

with f and each gi convex. Three classes worth recognising:

  • Linear program (LP) — linear f and linear gi. Solvable to optimality at scale (commercial solvers handle 107 variables).
  • Quadratic program (QP) — quadratic f, linear gi. SVM dual is a QP. Lasso is a QP.
  • Semidefinite program (SDP) — variables include matrices, with PSD constraints. Used in some kernel learning, control, and sparse-recovery literature.

Why convexity matters

Two structural results:

  • Local = global. A local minimum of a convex function is a global minimum. Gradient descent cannot get stuck in suboptimal basins.
  • First-order conditions are sufficient. f(x)=0 on an unconstrained convex problem certifies optimality. With constraints, the KKT conditions generalise this: stationarity, primal/dual feasibility, complementary slackness. Solving KKT solves the problem.

In non-convex problems (deep networks), neither holds. There can be many local minima, all stationary, with very different objective values. Why deep networks generalise well despite this is the central puzzle that double descent and implicit-bias work address.

Lagrangian duality

For a constrained problem minf(x) s.t. gi(x)0, the Lagrangian is

L(x,λ)=f(x)+iλigi(x),λi0.

The dual function g(λ)=minxL(x,λ) lower-bounds the primal optimum for any λ0. The dual problem maxλ0g(λ) is always concave (regardless of primal convexity), and for convex problems with strong duality, primal optimum = dual optimum.

The most consequential application: SVM (see SVM). Solving its dual reveals the kernel structure — predictions depend on inner products, not raw features — which is what unlocks the kernel trick.

Gradient methods and convergence rates

For smooth convex f with L-Lipschitz gradient, gradient descent with η=1/L converges as f(xt)f(x)=O(1/t). For strongly convex f (Hessian mI), convergence is linear: O((1m/L)t). Nesterov's accelerated method (1983) achieves O(1/t2) for smooth convex — the optimal first-order rate — and is the theoretical reason momentum-based optimisers work.

For non-smooth convex (lasso's 1, SVM hinge), use proximal methods or ADMM. Modern interior-point methods solve LP/QP/SDP at much higher rates than first-order, but require O(n3) operations per iteration so don't scale to deep-learning sizes.

Smooth, strongly convex problems are easy

The take-home for ML: classical methods (OLS, ridge, logistic, SVM) are smooth strongly convex problems with closed-form or rapidly-converging solvers. They were never the bottleneck of pre-2012 ML — feature engineering was. Modern deep learning gave up convexity for representation power; understanding what was lost (the local-minimum guarantee) is the point of this page.

  • OLS — the simplest closed-form convex problem.
  • SVM — convex QP whose dual unlocks kernels.
  • SGD, Momentum, Nesterov — Nesterov in the stochastic deep-learning regime.

Released under the MIT License. Content imported and adapted from NoteNextra.