Skip to content

Logistic Regression

Logistic regression is the canonical model for binary classification — a linear function of features passed through a sigmoid to produce a probability. Despite the name, it is not regression. It is the discrete-output cousin of OLS, the simplest member of the generalised-linear-model family, and the building block from which softmax regression, the perceptron, and the final layer of every classification network all descend.

The model

For binary labels y{0,1}, logistic regression posits

P(y=1x)=σ(βx)=11+eβx.

The sigmoid maps the unbounded linear predictor η=βx (the log-odds) into a probability in (0,1). Two readings:

  • Probabilistic — model the conditional class probability directly.
  • Linear log-oddslog(P(y=1)/P(y=0))=βx. Each unit increase in xj multiplies the odds of y=1 by eβj.

Maximum likelihood

The log-likelihood of N i.i.d. samples is

logL(β)=i=1Nyilogpi+(1yi)log(1pi),pi=σ(βxi).

Equivalently, the binary cross-entropy loss with a flipped sign. The gradient has a remarkably clean form:

βlogL=i(yipi)xi=X(yp).

Just like OLS, the gradient is the design matrix transpose times the residual — but the residual is now in probability space. There is no closed form for β^; instead, optimise iteratively.

Optimisation: Newton-Raphson / IRLS

The Hessian is

H=XWX,W=diag(pi(1pi)).

H is negative semi-definite, so the log-likelihood is concave — the unique stationary point is the global maximum. Newton's method updates

β(t+1)=β(t)+(XW(t)X)1X(yp(t)).

This is Iteratively Reweighted Least Squares (IRLS) — at each step you solve a weighted-least-squares problem. Converges in 5–10 iterations on well-behaved data; a robust default for small-to-medium logistic regression.

For large data, gradient descent / SGD is the practical solver — same approach as deep networks. Convergence is slower but per-step cost is lower.

Regularised logistic regression

L1 / L2 penalties extend directly:

β^=argminβlogL(β)+λΩ(β).

The penalised problem is still strictly convex (assuming non-zero λ for L2), with a unique optimum. Glmnet (Friedman et al., 2010) is the canonical R/Python package for fitting elastic-net-penalised logistic regression at scale via coordinate descent.

For complete linear separability, unregularised logistic regression has β^ — the likelihood increases unboundedly. Regularisation (or stopping early) is the standard fix.

Multi-class extension: softmax regression

For K-class classification, replace the sigmoid with softmax:

P(y=kx)=exp(βkx)j=1Kexp(βjx).

Trained by maximum likelihood with cross-entropy loss. Same convex objective, same gradient form. Softmax regression is what the final layer of every classification deep network computes.

Connection to deep learning

A neural-network classifier is just a stack of logistic regressions whose features are themselves learned. The final layer p^=σ(WLhL1+bL) is binary logistic regression on learned features hL1. Cross-entropy + softmax + linear layer is the universal classification head.

The connection runs deeper: the gradient of cross-entropy + softmax simplifies to (predicted - target) — the same gradient as OLS, applied in probability space. This is one of the reasons cross-entropy + softmax pairs so naturally with backpropagation.

Released under the MIT License. Content imported and adapted from NoteNextra.