Convex Optimization Basics

A convex optimisation problem has a convex objective on a convex feasible set. Such problems are easy in a precise sense: any local minimum is a global one, the dual lower-bounds the primal cleanly, and efficient algorithms exist. Most pre-deep-learning ML — OLS, ridge, lasso, SVM, logistic regression — fits inside this framework. Even when modern deep learning lives outside it, convex analysis is the right vocabulary for understanding what's not working.

Convex sets and functions

A set $C \subseteq R^{n}$ is convex if for any $x, y \in C$ and $λ \in [0, 1]$ , $λ x + (1 - λ) y \in C$ . The line segment between any two points stays in the set.

A function $f : R^{n} \to R$ is convex if its epigraph (the set of points above its graph) is convex, equivalently:

f (λ x + (1 - λ) y) \leq λ f (x) + (1 - λ) f (y) .

For twice-differentiable $f$ , convexity is equivalent to the Hessian being positive semi-definite everywhere. Strict convexity ( $<$ in the inequality) gives a unique minimum.

Useful operations preserve convexity: non-negative weighted sums, max, composition with affine maps. Logsumexp $\log \sum_{i} e^{x_{i}}$ is convex; that's why softmax + cross-entropy is a convex loss in the logits.

The convex problem

A standard-form convex problem is

min_{x} f (x) s.t. g_{i} (x) \leq 0, A x = b,

with $f$ and each $g_{i}$ convex. Three classes worth recognising:

Linear program (LP) — linear $f$ and linear $g_{i}$ . Solvable to optimality at scale (commercial solvers handle $10^{7}$ variables).
Quadratic program (QP) — quadratic $f$ , linear $g_{i}$ . SVM dual is a QP. Lasso is a QP.
Semidefinite program (SDP) — variables include matrices, with PSD constraints. Used in some kernel learning, control, and sparse-recovery literature.

Why convexity matters

Two structural results:

Local = global. A local minimum of a convex function is a global minimum. Gradient descent cannot get stuck in suboptimal basins.
First-order conditions are sufficient. $\nabla f (x^{*}) = 0$ on an unconstrained convex problem certifies optimality. With constraints, the KKT conditions generalise this: stationarity, primal/dual feasibility, complementary slackness. Solving KKT solves the problem.

In non-convex problems (deep networks), neither holds. There can be many local minima, all stationary, with very different objective values. Why deep networks generalise well despite this is the central puzzle that double descent and implicit-bias work address.

Lagrangian duality

For a constrained problem $min f (x)$ s.t. $g_{i} (x) \leq 0$ , the Lagrangian is

L (x, λ) = f (x) + \sum_{i} λ_{i} g_{i} (x), λ_{i} \geq 0.

The dual function $g (λ) = min_{x} L (x, λ)$ lower-bounds the primal optimum for any $λ \geq 0$ . The dual problem $max_{λ \geq 0} g (λ)$ is always concave (regardless of primal convexity), and for convex problems with strong duality, primal optimum = dual optimum.

The most consequential application: SVM (see SVM). Solving its dual reveals the kernel structure — predictions depend on inner products, not raw features — which is what unlocks the kernel trick.

Gradient methods and convergence rates

For smooth convex $f$ with $L$ -Lipschitz gradient, gradient descent with $η = 1 / L$ converges as $f (x_{t}) - f (x^{*}) = O (1 / t)$ . For strongly convex $f$ (Hessian $\geq m I$ ), convergence is linear: $O ((1 - m / L)^{t})$ . Nesterov's accelerated method (1983) achieves $O (1 / t^{2})$ for smooth convex — the optimal first-order rate — and is the theoretical reason momentum-based optimisers work.

For non-smooth convex (lasso's $ℓ_{1}$ , SVM hinge), use proximal methods or ADMM. Modern interior-point methods solve LP/QP/SDP at much higher rates than first-order, but require $O (n^{3})$ operations per iteration so don't scale to deep-learning sizes.

Smooth, strongly convex problems are easy

The take-home for ML: classical methods (OLS, ridge, logistic, SVM) are smooth strongly convex problems with closed-form or rapidly-converging solvers. They were never the bottleneck of pre-2012 ML — feature engineering was. Modern deep learning gave up convexity for representation power; understanding what was lost (the local-minimum guarantee) is the point of this page.

Convex Optimization Basics ​

Convex sets and functions ​

The convex problem ​

Why convexity matters ​

Lagrangian duality ​

Gradient methods and convergence rates ​

Smooth, strongly convex problems are easy ​

What to read next ​