Ridge & Lasso Regression
Ordinary Least Squares breaks down when features are correlated or when there are more features than samples — the design matrix becomes ill-conditioned or rank-deficient. Ridge and lasso regression add penalty terms that fix the conditioning problem and, in lasso's case, select features automatically. Together they are the canonical examples of how regularisation tames classical regression.
Ridge regression
The ridge objective adds an L2 penalty:
Setting the gradient to zero gives the closed form
The added
Bayesian interpretation. Ridge is the MAP estimate under a Gaussian prior
Effective degrees of freedom. For the ridge estimator,
Lasso regression
The lasso objective uses an L1 penalty:
There is no closed form — the L1 penalty is non-differentiable at zero — but the problem is convex. Standard solvers: coordinate descent (Friedman et al., 2007), proximal-gradient ISTA / FISTA, and the LARS algorithm.
The geometric distinction from ridge is critical. The L1 ball has corners on the coordinate axes; the squared-error contours touch the ball preferentially at these corners, producing exactly-zero coefficients. Lasso performs automatic feature selection as part of fitting.
For orthonormal
Coefficients smaller in magnitude than
Elastic net
Lasso has two known weaknesses: (1) when features are highly correlated, lasso picks one and zeros the rest somewhat arbitrarily; (2) when
Elastic net (Zou & Hastie, JRSS-B 2005) combines both penalties:
The L2 component groups correlated features (their coefficients move together), while L1 drives unimportant ones to zero. Elastic net is the practical default in high-dimensional regression when features are correlated — gene expression, fMRI, NLP feature sets.
When to use which
A quick guide:
- Many predictors, all believed relevant — ridge.
- Suspected sparse truth, uncorrelated features — lasso.
- Suspected sparse truth, correlated features — elastic net.
— lasso or elastic net (ridge can't do feature selection but can still regularise). - Only goal is generalisation, not interpretability — ridge tends to give slightly lower test error in non-sparse regimes; lasso wins when the truth really is sparse.
Cross-validate
What to read next
- Ordinary Least Squares — the unregularised baseline.
- Regularization Theory — the broader framework these are special cases of.
- Logistic Regression — same penalties applied to classification.