Ordinary Least Squares (OLS)

OLS is the first model worth deeply understanding. It is the simplest predictor that makes a non-trivial probabilistic assumption, and almost every modern technique — ridge, logistic regression, neural network linear layers, even the final projection of an LLM — reduces to OLS in some limit.

Setup

Given a design matrix $X \in R^{n \times d}$ and a response vector $y \in R^{n}$ , OLS chooses the coefficient vector $β \in R^{d}$ that minimizes the residual sum of squares

\hat{β} = \arg min_{β} ‖ y - X β ‖_{2}^{2} .

Closed form

Setting the gradient to zero gives the normal equations $X^{⊤} X β = X^{⊤} y$ , so when $X^{⊤} X$ is invertible

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y .

In practice we never form the inverse — we solve the linear system via QR or SVD for numerical stability.

Geometric view

$X \hat{β}$ is the orthogonal projection of $y$ onto the column space of $X$ . The residual $y - X \hat{β}$ is orthogonal to every column of $X$ — that orthogonality is the normal equations.

Probabilistic view

If $y = X β + ε$ with $ε \sim N (0, σ^{2} I)$ , then OLS is the MLE of $β$ . This is the bridge to ridge (Gaussian prior), Bayesian linear regression, and ultimately to GLMs.

Ordinary Least Squares (OLS) ​

Setup ​

Closed form ​

Geometric view ​

Probabilistic view ​

What to read next ​

Ordinary Least Squares (OLS)

Setup

Closed form

Geometric view

Probabilistic view

What to read next