Regularization (L1, L2, Elastic Net)
Regularisation adds a penalty term to the empirical risk to discourage overly complex hypotheses. It is the continuous-relaxation answer to "how do we control model capacity without committing to a discrete model selection step?". L2 (ridge), L1 (lasso), and elastic-net are the canonical penalties for linear models and generalise to weight decay in deep networks.
The penalised objective
Replace ERM with regularised empirical risk minimisation:
with
L2 / ridge / Tikhonov
The L2 penalty is
The
Bayesian view: ridge is MAP estimation with a Gaussian prior
In deep networks, L2 regularisation appears as weight decay in the optimiser update:
L1 / lasso
The L1 penalty is
L1 is non-differentiable at zero, but the optimisation is still convex. Solvable with coordinate descent, ISTA / FISTA, or LARS (Least Angle Regression).
The geometric distinction from L2: the L1 ball has corners along the axes, so the optimum often sits on a corner — i.e., some coefficients are exactly zero. L1 produces sparse solutions automatically, which makes it the right tool when you suspect only a few features are relevant.
Elastic net
When features are correlated, lasso arbitrarily picks one of a correlated group and zeros the rest. Elastic net (Zou, Hastie, JRSS-B 2005) combines both penalties:
The L2 component groups correlated features (they get similar coefficients), while L1 drives unimportant ones to zero. Elastic net is the practical default for high-dimensional regression with correlated features (gene expression, text features, fMRI).
The bias-variance reading
Regularisation increases bias (the model can't fit the training data as flexibly) and reduces variance (the predictor is more stable across training sets). Cross-validating
Implicit regularisation
The penalties above are explicit — written into the objective. Implicit regularisation comes from the optimisation algorithm itself:
- Early stopping — stop SGD before convergence; equivalent to L2 regularisation in some convex cases.
- SGD on over-parameterised models — finds low-norm interpolating solutions even without an explicit penalty.
- Dropout, batch norm, augmentation — each inject noise that has regularising effects without an explicit
.
Modern deep learning relies heavily on implicit regularisation. Explicit weight decay is still used (often with small
When to use what
- Linear regression on small/medium data with informative features — ridge is the safe default.
- High-dimensional regression with sparse truth — lasso or elastic net.
- Logistic regression / SVM — L2 by default; L1 if you want feature selection.
- Deep networks — weight decay (L2) at
, plus implicit regularisers (dropout, augmentation, early stopping). Lasso-style sparsity is rare except in pruning literature.
What to read next
- Ridge & Lasso Regression — these penalties applied to OLS in detail.
- Bias-Variance Tradeoff — what regularisation trades.
- Dropout — the deep-learning regulariser most analogous to noise injection.