Skip to content

Ridge & Lasso Regression

Ordinary Least Squares breaks down when features are correlated or when there are more features than samples — the design matrix becomes ill-conditioned or rank-deficient. Ridge and lasso regression add penalty terms that fix the conditioning problem and, in lasso's case, select features automatically. Together they are the canonical examples of how regularisation tames classical regression.

Ridge regression

The ridge objective adds an L2 penalty:

β^ridge=argminβ12NyXβ22+λβ22.

Setting the gradient to zero gives the closed form

β^ridge=(XX+λNI)1Xy.

The added λNI shifts every eigenvalue of XX up by λN, fixing ill-conditioning when columns are nearly collinear. Geometrically, ridge shrinks all coefficients smoothly toward zero — large coefficients shrink proportionally, none reach exactly zero.

Bayesian interpretation. Ridge is the MAP estimate under a Gaussian prior βN(0,σ2/(λN)I). The penalty strength λ encodes the prior's tightness: larger λ = stronger belief in small coefficients.

Effective degrees of freedom. For the ridge estimator, df(λ)=tr[X(XX+λNI)1X]=iσi2/(σi2+λN), where σi are the singular values of X. Ridge spends "effective parameters" smoothly, in contrast to OLS's hard p degrees of freedom.

Lasso regression

The lasso objective uses an L1 penalty:

β^lasso=argminβ12NyXβ22+λβ1.

There is no closed form — the L1 penalty is non-differentiable at zero — but the problem is convex. Standard solvers: coordinate descent (Friedman et al., 2007), proximal-gradient ISTA / FISTA, and the LARS algorithm.

The geometric distinction from ridge is critical. The L1 ball has corners on the coordinate axes; the squared-error contours touch the ball preferentially at these corners, producing exactly-zero coefficients. Lasso performs automatic feature selection as part of fitting.

For orthonormal X (XX=I), the lasso solution is soft thresholding:

β^j=sign(β^jOLS)max(|β^jOLS|λ,0).

Coefficients smaller in magnitude than λ are zeroed; larger ones are shrunk by λ. The general non-orthonormal case is more complex but qualitatively similar.

Elastic net

Lasso has two known weaknesses: (1) when features are highly correlated, lasso picks one and zeros the rest somewhat arbitrarily; (2) when p>N, lasso selects at most N features.

Elastic net (Zou & Hastie, JRSS-B 2005) combines both penalties:

β^enet=argminβ12NyXβ22+λ(αβ1+(1α)12β22),α[0,1].

The L2 component groups correlated features (their coefficients move together), while L1 drives unimportant ones to zero. Elastic net is the practical default in high-dimensional regression when features are correlated — gene expression, fMRI, NLP feature sets.

When to use which

A quick guide:

  • Many predictors, all believed relevant — ridge.
  • Suspected sparse truth, uncorrelated features — lasso.
  • Suspected sparse truth, correlated features — elastic net.
  • pN — lasso or elastic net (ridge can't do feature selection but can still regularise).
  • Only goal is generalisation, not interpretability — ridge tends to give slightly lower test error in non-sparse regimes; lasso wins when the truth really is sparse.

Cross-validate λ on a log grid; for elastic net, also cross-validate α — typical α{0.1,0.5,0.9}.

Released under the MIT License. Content imported and adapted from NoteNextra.