Markov Chains

A Markov chain is a stochastic process where the next state depends only on the current state, not on the history that led to it. The Markov property is the structural assumption underneath HMMs, MDPs, MCMC, n-gram language models, and most of probabilistic time-series modelling. Understanding the chain's stationary behaviour and convergence rate is the prerequisite for the rest.

The Markov property

A discrete-time stochastic process ${X_{t}}_{t \geq 0}$ over a state space $S$ is a Markov chain if

P (X_{t + 1} ∣ X_{t}, X_{t - 1}, \dots, X_{0}) = P (X_{t + 1} ∣ X_{t}) .

For finite $S$ , the chain is fully described by the transition matrix $P$ with $P_{i j} = P (X_{t + 1} = j ∣ X_{t} = i)$ . Each row sums to 1 — $P$ is a stochastic matrix.

Probability evolution is matrix multiplication: if $π_{t} \in R^{| S |}$ is the row-vector distribution of $X_{t}$ , then $π_{t + 1} = π_{t} P$ , and $π_{t + k} = π_{t} P^{k}$ .

Classifications

States have structure:

Recurrent — the chain returns to it with probability 1.
Transient — only visited finitely many times almost surely.
Absorbing — once entered, never left ( $P_{i i} = 1$ ).
Periodic — return times share a common divisor $> 1$ .

A chain is irreducible if every state can reach every other; aperiodic if no period $> 1$ . Irreducible aperiodic chains on a finite state space have a unique stationary distribution $π^{*}$ satisfying $π^{*} P = π^{*}$ , and they converge to it from any start.

Stationary distribution

The stationary $π^{*}$ is the left eigenvector of $P$ with eigenvalue 1. For finite state spaces it can be solved as a linear system $π^{*} (P - I) = 0$ subject to $\sum_{i} π_{i}^{*} = 1$ .

A useful sufficient condition: if there exists $π$ with $π_{i} P_{i j} = π_{j} P_{j i}$ for all $i, j$ (the detailed balance equation), then $π$ is stationary. Detailed balance gives reversible chains — the structural building block of MCMC.

Mixing time and the spectral gap

How fast does $π_{t}$ approach $π^{*}$ ? For irreducible aperiodic finite chains, convergence is geometric:

∥ π_{t} - π^{*} ∥_{TV} \leq C λ_{2}^{t},

where $λ_{2}$ is the second-largest eigenvalue of $P$ in absolute value. The spectral gap $1 - | λ_{2} |$ is the rate of convergence; the mixing time $t_{mix}$ is roughly $\log (1 / ϵ) / (1 - | λ_{2} |)$ .

The spectral gap is what determines whether MCMC inference is fast or slow. Multimodal target distributions typically produce small gaps and slow mixing; this is one of the central practical problems with MCMC.

MCMC: building chains for inference

Markov Chain Monte Carlo runs the logic in reverse: given a target distribution $π$ we want to sample from, construct a chain whose stationary distribution is $π$ .

Metropolis-Hastings — propose moves from a proposal distribution $q$ , accept with probability $min (1, π (x^{'}) q (x ∣ x^{'}) / π (x) q (x^{'} ∣ x))$ . Detailed balance gives stationarity.
Gibbs sampling — sample each variable from its conditional given the others. Special case of MH with proposals that are always accepted.
Hamiltonian MC — use gradient information to take long, low-rejection moves; fast mixing for continuous high-dim targets.

MCMC is the workhorse of Bayesian inference, posterior sampling, and any setting where you can evaluate $π$ up to a normalising constant but cannot sample directly.

Where Markov chains appear in ML

Reinforcement learning — MDPs are Markov chains with actions and rewards.
Hidden Markov Models — see HMM.
n-gram language models — text as an order- $n$ Markov chain over tokens.
MCMC inference — Bayesian posterior sampling.
PageRank — Google's original ranking is the stationary distribution of a Markov chain on the web graph.
Chain-of-thought reasoning at inference — autoregressive LM generation is a Markov chain over token sequences (the state being the prefix).

Markov Chains ​

The Markov property ​

Classifications ​

Stationary distribution ​

Mixing time and the spectral gap ​

MCMC: building chains for inference ​

Where Markov chains appear in ML ​

What to read next ​