Logistic Regression
Logistic regression is the canonical model for binary classification — a linear function of features passed through a sigmoid to produce a probability. Despite the name, it is not regression. It is the discrete-output cousin of OLS, the simplest member of the generalised-linear-model family, and the building block from which softmax regression, the perceptron, and the final layer of every classification network all descend.
The model
For binary labels
The sigmoid maps the unbounded linear predictor
- Probabilistic — model the conditional class probability directly.
- Linear log-odds —
. Each unit increase in multiplies the odds of by .
Maximum likelihood
The log-likelihood of
Equivalently, the binary cross-entropy loss with a flipped sign. The gradient has a remarkably clean form:
Just like OLS, the gradient is the design matrix transpose times the residual — but the residual is now in probability space. There is no closed form for
Optimisation: Newton-Raphson / IRLS
The Hessian is
This is Iteratively Reweighted Least Squares (IRLS) — at each step you solve a weighted-least-squares problem. Converges in 5–10 iterations on well-behaved data; a robust default for small-to-medium logistic regression.
For large data, gradient descent / SGD is the practical solver — same approach as deep networks. Convergence is slower but per-step cost is lower.
Regularised logistic regression
L1 / L2 penalties extend directly:
The penalised problem is still strictly convex (assuming non-zero
For complete linear separability, unregularised logistic regression has
Multi-class extension: softmax regression
For
Trained by maximum likelihood with cross-entropy loss. Same convex objective, same gradient form. Softmax regression is what the final layer of every classification deep network computes.
Connection to deep learning
A neural-network classifier is just a stack of logistic regressions whose features are themselves learned. The final layer
The connection runs deeper: the gradient of cross-entropy + softmax simplifies to (predicted - target) — the same gradient as OLS, applied in probability space. This is one of the reasons cross-entropy + softmax pairs so naturally with backpropagation.
What to read next
- Generalized Linear Models — the unified framework for regression-with-a-link-function.
- Ridge & Lasso Regression — the same penalties applied to OLS.
- Activation Functions — softmax in the deep-learning context.