Ridge & Lasso Regression

Ordinary Least Squares breaks down when features are correlated or when there are more features than samples — the design matrix becomes ill-conditioned or rank-deficient. Ridge and lasso regression add penalty terms that fix the conditioning problem and, in lasso's case, select features automatically. Together they are the canonical examples of how regularisation tames classical regression.

Ridge regression

The ridge objective adds an L2 penalty:

{\hat{β}}_{ridge} = \arg min_{β} \frac{1}{2 N} ∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{2}^{2} .

Setting the gradient to zero gives the closed form

{\hat{β}}_{ridge} = (X^{⊤} X + λ N I)^{- 1} X^{⊤} y .

The added $λ N I$ shifts every eigenvalue of $X^{⊤} X$ up by $λ N$ , fixing ill-conditioning when columns are nearly collinear. Geometrically, ridge shrinks all coefficients smoothly toward zero — large coefficients shrink proportionally, none reach exactly zero.

Bayesian interpretation. Ridge is the MAP estimate under a Gaussian prior $β \sim N (0, σ^{2} / (λ N) \cdot I)$ . The penalty strength $λ$ encodes the prior's tightness: larger $λ$ = stronger belief in small coefficients.

Effective degrees of freedom. For the ridge estimator, $df (λ) = tr [X (X^{⊤} X + λ N I)^{- 1} X^{⊤}] = \sum_{i} σ_{i}^{2} / (σ_{i}^{2} + λ N)$ , where $σ_{i}$ are the singular values of $X$ . Ridge spends "effective parameters" smoothly, in contrast to OLS's hard $p$ degrees of freedom.

Lasso regression

The lasso objective uses an L1 penalty:

{\hat{β}}_{lasso} = \arg min_{β} \frac{1}{2 N} ∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{1} .

There is no closed form — the L1 penalty is non-differentiable at zero — but the problem is convex. Standard solvers: coordinate descent (Friedman et al., 2007), proximal-gradient ISTA / FISTA, and the LARS algorithm.

The geometric distinction from ridge is critical. The L1 ball has corners on the coordinate axes; the squared-error contours touch the ball preferentially at these corners, producing exactly-zero coefficients. Lasso performs automatic feature selection as part of fitting.

For orthonormal $X$ ( $X^{⊤} X = I$ ), the lasso solution is soft thresholding:

{\hat{β}}_{j} = sign ({\hat{β}}_{j}^{OLS}) \cdot max (| {\hat{β}}_{j}^{OLS} | - λ, 0) .

Coefficients smaller in magnitude than $λ$ are zeroed; larger ones are shrunk by $λ$ . The general non-orthonormal case is more complex but qualitatively similar.

Elastic net

Lasso has two known weaknesses: (1) when features are highly correlated, lasso picks one and zeros the rest somewhat arbitrarily; (2) when $p > N$ , lasso selects at most $N$ features.

Elastic net (Zou & Hastie, JRSS-B 2005) combines both penalties:

{\hat{β}}_{enet} = \arg min_{β} \frac{1}{2 N} ∥ y - X β ∥_{2}^{2} + λ (α ∥ β ∥_{1} + (1 - α) \frac{1}{2} ∥ β ∥_{2}^{2}), α \in [0, 1] .

The L2 component groups correlated features (their coefficients move together), while L1 drives unimportant ones to zero. Elastic net is the practical default in high-dimensional regression when features are correlated — gene expression, fMRI, NLP feature sets.

When to use which

A quick guide:

Many predictors, all believed relevant — ridge.
Suspected sparse truth, uncorrelated features — lasso.
Suspected sparse truth, correlated features — elastic net.
$p ≫ N$ — lasso or elastic net (ridge can't do feature selection but can still regularise).
Only goal is generalisation, not interpretability — ridge tends to give slightly lower test error in non-sparse regimes; lasso wins when the truth really is sparse.

Cross-validate $λ$ on a log grid; for elastic net, also cross-validate $α$ — typical $α \in {0.1, 0.5, 0.9}$ .

Ridge & Lasso Regression ​

Ridge regression ​

Lasso regression ​

Elastic net ​

When to use which ​

What to read next ​

Ridge & Lasso Regression

Ridge regression

Lasso regression

Elastic net

When to use which

What to read next