Logistic Regression

Logistic regression is the canonical model for binary classification — a linear function of features passed through a sigmoid to produce a probability. Despite the name, it is not regression. It is the discrete-output cousin of OLS, the simplest member of the generalised-linear-model family, and the building block from which softmax regression, the perceptron, and the final layer of every classification network all descend.

The model

For binary labels $y \in {0, 1}$ , logistic regression posits

P (y = 1 ∣ x) = σ (β^{⊤} x) = \frac{1}{1 + e^{- β^{⊤} x}} .

The sigmoid maps the unbounded linear predictor $η = β^{⊤} x$ (the log-odds) into a probability in $(0, 1)$ . Two readings:

Probabilistic — model the conditional class probability directly.
Linear log-odds — $\log (P (y = 1) / P (y = 0)) = β^{⊤} x$ . Each unit increase in $x_{j}$ multiplies the odds of $y = 1$ by $e^{β_{j}}$ .

Maximum likelihood

The log-likelihood of $N$ i.i.d. samples is

\log L (β) = \sum_{i = 1}^{N} y_{i} \log p_{i} + (1 - y_{i}) \log (1 - p_{i}), p_{i} = σ (β^{⊤} x_{i}) .

Equivalently, the binary cross-entropy loss with a flipped sign. The gradient has a remarkably clean form:

\nabla_{β} \log L = \sum_{i} (y_{i} - p_{i}) x_{i} = X^{⊤} (y - p) .

Just like OLS, the gradient is the design matrix transpose times the residual — but the residual is now in probability space. There is no closed form for $\hat{β}$ ; instead, optimise iteratively.

Optimisation: Newton-Raphson / IRLS

The Hessian is

H = - X^{⊤} W X, W = diag (p_{i} (1 - p_{i})) .

$H$ is negative semi-definite, so the log-likelihood is concave — the unique stationary point is the global maximum. Newton's method updates

β^{(t + 1)} = β^{(t)} + (X^{⊤} W^{(t)} X)^{- 1} X^{⊤} (y - p^{(t)}) .

This is Iteratively Reweighted Least Squares (IRLS) — at each step you solve a weighted-least-squares problem. Converges in 5–10 iterations on well-behaved data; a robust default for small-to-medium logistic regression.

For large data, gradient descent / SGD is the practical solver — same approach as deep networks. Convergence is slower but per-step cost is lower.

Regularised logistic regression

L1 / L2 penalties extend directly:

\hat{β} = \arg min_{β} - \log L (β) + λ Ω (β) .

The penalised problem is still strictly convex (assuming non-zero $λ$ for L2), with a unique optimum. Glmnet (Friedman et al., 2010) is the canonical R/Python package for fitting elastic-net-penalised logistic regression at scale via coordinate descent.

For complete linear separability, unregularised logistic regression has $\hat{β} \to \infty$ — the likelihood increases unboundedly. Regularisation (or stopping early) is the standard fix.

Multi-class extension: softmax regression

For $K$ -class classification, replace the sigmoid with softmax:

P (y = k ∣ x) = \frac{\exp (β_{k}^{⊤} x)}{\sum_{j = 1}^{K} \exp (β_{j}^{⊤} x)} .

Trained by maximum likelihood with cross-entropy loss. Same convex objective, same gradient form. Softmax regression is what the final layer of every classification deep network computes.

Connection to deep learning

A neural-network classifier is just a stack of logistic regressions whose features are themselves learned. The final layer $\hat{p} = σ (W_{L} h_{L - 1} + b_{L})$ is binary logistic regression on learned features $h_{L - 1}$ . Cross-entropy + softmax + linear layer is the universal classification head.

The connection runs deeper: the gradient of cross-entropy + softmax simplifies to (predicted - target) — the same gradient as OLS, applied in probability space. This is one of the reasons cross-entropy + softmax pairs so naturally with backpropagation.

Logistic Regression ​

The model ​

Maximum likelihood ​

Optimisation: Newton-Raphson / IRLS ​

Regularised logistic regression ​

Multi-class extension: softmax regression ​

Connection to deep learning ​

What to read next ​