Generalized Linear Models

The Generalised Linear Model (GLM) is the unified framework that contains OLS, logistic regression, Poisson regression, and several others as special cases. Three ingredients — a distribution for the response, a linear predictor, and a link function — fit a wide variety of regression problems with one set of estimation tools (IRLS) and one set of theoretical guarantees.

The three components

Random component. The response $Y$ has a distribution from the exponential family:

p (y; θ, ϕ) = \exp (\frac{y θ - b (θ)}{a (ϕ)} + c (y, ϕ)) .

This includes Gaussian, Bernoulli, binomial, Poisson, gamma, and inverse Gaussian distributions. The function $b (θ)$ determines the mean: $E [Y] = b^{'} (θ) \equiv μ$ .

Systematic component. A linear predictor

η = β^{⊤} x .

Link function. A monotone, differentiable function $g$ connects mean and linear predictor:

g (μ) = η .

The canonical link is the one that makes $θ = η$ — natural-parameter equals linear-predictor. For Bernoulli, the canonical link is the logit, recovering logistic regression. For Gaussian it is the identity, recovering OLS. For Poisson it is the log, giving Poisson regression for count data.

Common GLMs in one table

Response	Distribution	Canonical link	Use case
Continuous, unbounded	Gaussian	identity	linear regression
Binary	Bernoulli	logit	logistic regression
Counts	Poisson	log	event-rate modelling
Positive continuous	Gamma	inverse	duration, claim sizes
Proportions	Binomial	logit	bounded counts

The strength of the GLM framework is that all of these fit with the same algorithm and share the same theoretical machinery.

Maximum likelihood and IRLS

For the canonical link, the log-likelihood is concave and the gradient has the OLS-like form

\nabla_{β} \log L = X^{⊤} (y - μ) .

The Hessian is $- X^{⊤} W X$ for a diagonal weight matrix $W$ that depends on the variance function. Iteratively Reweighted Least Squares (IRLS) updates

β^{(t + 1)} = (X^{⊤} W^{(t)} X)^{- 1} X^{⊤} W^{(t)} z^{(t)},

where $z$ is a working response. Each step is a weighted OLS. Convergence is fast (5–10 iterations) on well-conditioned problems.

Deviance — the GLM loss

The natural goodness-of-fit measure is the deviance:

D = 2 [\log L_{saturated} - \log L_{model}],

where the saturated model fits each observation perfectly. For Gaussian responses, deviance reduces to the residual sum of squares. For Bernoulli, it is twice the binary cross-entropy. Deviance is the right "loss" to minimise within the GLM framework — for non-Gaussian responses, plain MSE is biased.

Why GLMs matter today

Three reasons:

Insurance, epidemiology, social science — count and rate data is everywhere; Poisson and negative-binomial GLMs are the standard.
Interpretability — coefficients have clean meaning (multiplicative effect on the mean for log-link models, odds ratio for logit), which matters in regulated domains.
Conceptual link to deep learning — the final layer of many modern networks is a GLM in disguise. Choosing the right output activation and loss is choosing the right (link, distribution) pair.

For the deep-learning practitioner, GLMs are the correct mental model for what your output head should be:

Real-valued target → linear output + MSE (Gaussian GLM).
Binary target → sigmoid + BCE (Bernoulli GLM).
Categorical target → softmax + cross-entropy (multinomial GLM).
Count target → exp output + Poisson NLL (Poisson GLM).
Positive continuous → exp output + gamma NLL.

Picking the right combination matters more than tweaking the network body.

Generalized Linear Models ​

The three components ​

Common GLMs in one table ​

Maximum likelihood and IRLS ​

Deviance — the GLM loss ​

Why GLMs matter today ​

What to read next ​