Skip to content

Ordinary Least Squares (OLS)

OLS is the first model worth deeply understanding. It is the simplest predictor that makes a non-trivial probabilistic assumption, and almost every modern technique — ridge, logistic regression, neural network linear layers, even the final projection of an LLM — reduces to OLS in some limit.

Setup

Given a design matrix XRn×d and a response vector yRn, OLS chooses the coefficient vector βRd that minimizes the residual sum of squares

β^=argminβyXβ22.

Closed form

Setting the gradient to zero gives the normal equations XXβ=Xy, so when XX is invertible

β^=(XX)1Xy.

In practice we never form the inverse — we solve the linear system via QR or SVD for numerical stability.

Geometric view

Xβ^ is the orthogonal projection of y onto the column space of X. The residual yXβ^ is orthogonal to every column of X — that orthogonality is the normal equations.

Probabilistic view

If y=Xβ+ε with εN(0,σ2I), then OLS is the MLE of β. This is the bridge to ridge (Gaussian prior), Bayesian linear regression, and ultimately to GLMs.

Stub status

This page has a seed introduction. Expand sections on Gauss–Markov, leverage, influence, and the bias–variance decomposition.

Released under the MIT License. Content imported and adapted from NoteNextra.