Skip to content

Regularization (L1, L2, Elastic Net)

Regularisation adds a penalty term to the empirical risk to discourage overly complex hypotheses. It is the continuous-relaxation answer to "how do we control model capacity without committing to a discrete model selection step?". L2 (ridge), L1 (lasso), and elastic-net are the canonical penalties for linear models and generalise to weight decay in deep networks.

The penalised objective

Replace ERM with regularised empirical risk minimisation:

θ^=argminθR^S(θ)+λΩ(θ),

with Ω a penalty function and λ0 a hyperparameter trading data fit for complexity. As λ0, recover unregularised ERM; as λ, force θ toward whatever Ω prefers (zero, low norm, sparsity).

L2 / ridge / Tikhonov

The L2 penalty is Ω(θ)=θ22=iθi2. For linear regression with squared loss this gives ridge regression:

θ^ridge=(XX+λI)1Xy.

The λI shifts the eigenvalues of XX away from zero, fixing ill-conditioning when columns of X are correlated. Geometrically, ridge shrinks the coefficients smoothly toward zero — small coefficients stay small, large ones get smaller proportionally.

Bayesian view: ridge is MAP estimation with a Gaussian prior θN(0,σ2/λI) on the parameters. The prior's tightness corresponds to the regularisation strength.

In deep networks, L2 regularisation appears as weight decay in the optimiser update: θθη(L+λθ). As discussed in Adam vs AdamW, the correct implementation in adaptive optimisers is decoupled (AdamW), not via L2 in the gradient.

L1 / lasso

The L1 penalty is Ω(θ)=θ1=i|θi|. For linear regression with squared loss this gives lasso:

θ^lasso=argminθ12NyXθ22+λθ1.

L1 is non-differentiable at zero, but the optimisation is still convex. Solvable with coordinate descent, ISTA / FISTA, or LARS (Least Angle Regression).

The geometric distinction from L2: the L1 ball has corners along the axes, so the optimum often sits on a corner — i.e., some coefficients are exactly zero. L1 produces sparse solutions automatically, which makes it the right tool when you suspect only a few features are relevant.

Elastic net

When features are correlated, lasso arbitrarily picks one of a correlated group and zeros the rest. Elastic net (Zou, Hastie, JRSS-B 2005) combines both penalties:

Ω(θ)=αθ1+(1α)12θ22,α[0,1].

The L2 component groups correlated features (they get similar coefficients), while L1 drives unimportant ones to zero. Elastic net is the practical default for high-dimensional regression with correlated features (gene expression, text features, fMRI).

The bias-variance reading

Regularisation increases bias (the model can't fit the training data as flexibly) and reduces variance (the predictor is more stable across training sets). Cross-validating λ chooses the bias-variance trade-off that minimises validation error — the practical version of the bias-variance balance.

Implicit regularisation

The penalties above are explicit — written into the objective. Implicit regularisation comes from the optimisation algorithm itself:

  • Early stopping — stop SGD before convergence; equivalent to L2 regularisation in some convex cases.
  • SGD on over-parameterised models — finds low-norm interpolating solutions even without an explicit penalty.
  • Dropout, batch norm, augmentation — each inject noise that has regularising effects without an explicit Ω.

Modern deep learning relies heavily on implicit regularisation. Explicit weight decay is still used (often with small λ0.010.1 in AdamW recipes), but the bulk of generalisation control comes from the optimiser-architecture-data interaction.

When to use what

  • Linear regression on small/medium data with informative features — ridge is the safe default.
  • High-dimensional regression with sparse truth — lasso or elastic net.
  • Logistic regression / SVM — L2 by default; L1 if you want feature selection.
  • Deep networks — weight decay (L2) at λ[105,102], plus implicit regularisers (dropout, augmentation, early stopping). Lasso-style sparsity is rare except in pruning literature.

Released under the MIT License. Content imported and adapted from NoteNextra.