Regularization (L1, L2, Elastic Net)

Regularisation adds a penalty term to the empirical risk to discourage overly complex hypotheses. It is the continuous-relaxation answer to "how do we control model capacity without committing to a discrete model selection step?". L2 (ridge), L1 (lasso), and elastic-net are the canonical penalties for linear models and generalise to weight decay in deep networks.

The penalised objective

Replace ERM with regularised empirical risk minimisation:

\hat{θ} = \arg min_{θ} {\hat{R}}_{S} (θ) + λ Ω (θ),

with $Ω$ a penalty function and $λ \geq 0$ a hyperparameter trading data fit for complexity. As $λ \to 0$ , recover unregularised ERM; as $λ \to \infty$ , force $θ$ toward whatever $Ω$ prefers (zero, low norm, sparsity).

L2 / ridge / Tikhonov

The L2 penalty is $Ω (θ) = ∥ θ ∥_{2}^{2} = \sum_{i} θ_{i}^{2}$ . For linear regression with squared loss this gives ridge regression:

{\hat{θ}}_{ridge} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y .

The $λ I$ shifts the eigenvalues of $X^{⊤} X$ away from zero, fixing ill-conditioning when columns of $X$ are correlated. Geometrically, ridge shrinks the coefficients smoothly toward zero — small coefficients stay small, large ones get smaller proportionally.

Bayesian view: ridge is MAP estimation with a Gaussian prior $θ \sim N (0, σ^{2} / λ \cdot I)$ on the parameters. The prior's tightness corresponds to the regularisation strength.

In deep networks, L2 regularisation appears as weight decay in the optimiser update: $θ \leftarrow θ - η (\nabla L + λ θ)$ . As discussed in Adam vs AdamW, the correct implementation in adaptive optimisers is decoupled (AdamW), not via L2 in the gradient.

L1 / lasso

The L1 penalty is $Ω (θ) = ∥ θ ∥_{1} = \sum_{i} | θ_{i} |$ . For linear regression with squared loss this gives lasso:

{\hat{θ}}_{lasso} = \arg min_{θ} \frac{1}{2 N} ∥ y - X θ ∥_{2}^{2} + λ ∥ θ ∥_{1} .

L1 is non-differentiable at zero, but the optimisation is still convex. Solvable with coordinate descent, ISTA / FISTA, or LARS (Least Angle Regression).

The geometric distinction from L2: the L1 ball has corners along the axes, so the optimum often sits on a corner — i.e., some coefficients are exactly zero. L1 produces sparse solutions automatically, which makes it the right tool when you suspect only a few features are relevant.

Elastic net

When features are correlated, lasso arbitrarily picks one of a correlated group and zeros the rest. Elastic net (Zou, Hastie, JRSS-B 2005) combines both penalties:

Ω (θ) = α ∥ θ ∥_{1} + (1 - α) \frac{1}{2} ∥ θ ∥_{2}^{2}, α \in [0, 1] .

The L2 component groups correlated features (they get similar coefficients), while L1 drives unimportant ones to zero. Elastic net is the practical default for high-dimensional regression with correlated features (gene expression, text features, fMRI).

The bias-variance reading

Regularisation increases bias (the model can't fit the training data as flexibly) and reduces variance (the predictor is more stable across training sets). Cross-validating $λ$ chooses the bias-variance trade-off that minimises validation error — the practical version of the bias-variance balance.

Implicit regularisation

The penalties above are explicit — written into the objective. Implicit regularisation comes from the optimisation algorithm itself:

Early stopping — stop SGD before convergence; equivalent to L2 regularisation in some convex cases.
SGD on over-parameterised models — finds low-norm interpolating solutions even without an explicit penalty.
Dropout, batch norm, augmentation — each inject noise that has regularising effects without an explicit $Ω$ .

Modern deep learning relies heavily on implicit regularisation. Explicit weight decay is still used (often with small $λ \approx 0.01$ – $0.1$ in AdamW recipes), but the bulk of generalisation control comes from the optimiser-architecture-data interaction.

When to use what

Linear regression on small/medium data with informative features — ridge is the safe default.
High-dimensional regression with sparse truth — lasso or elastic net.
Logistic regression / SVM — L2 by default; L1 if you want feature selection.
Deep networks — weight decay (L2) at $λ \in [10^{- 5}, 10^{- 2}]$ , plus implicit regularisers (dropout, augmentation, early stopping). Lasso-style sparsity is rare except in pruning literature.

Regularization (L1, L2, Elastic Net) ​

The penalised objective ​

L2 / ridge / Tikhonov ​

L1 / lasso ​

Elastic net ​

The bias-variance reading ​

Implicit regularisation ​

When to use what ​

What to read next ​