Skip to content

Markov Chains

A Markov chain is a stochastic process where the next state depends only on the current state, not on the history that led to it. The Markov property is the structural assumption underneath HMMs, MDPs, MCMC, n-gram language models, and most of probabilistic time-series modelling. Understanding the chain's stationary behaviour and convergence rate is the prerequisite for the rest.

The Markov property

A discrete-time stochastic process {Xt}t0 over a state space S is a Markov chain if

P(Xt+1Xt,Xt1,,X0)=P(Xt+1Xt).

For finite S, the chain is fully described by the transition matrix P with Pij=P(Xt+1=jXt=i). Each row sums to 1 — P is a stochastic matrix.

Probability evolution is matrix multiplication: if πtR|S| is the row-vector distribution of Xt, then πt+1=πtP, and πt+k=πtPk.

Classifications

States have structure:

  • Recurrent — the chain returns to it with probability 1.
  • Transient — only visited finitely many times almost surely.
  • Absorbing — once entered, never left (Pii=1).
  • Periodic — return times share a common divisor >1.

A chain is irreducible if every state can reach every other; aperiodic if no period >1. Irreducible aperiodic chains on a finite state space have a unique stationary distribution π satisfying πP=π, and they converge to it from any start.

Stationary distribution

The stationary π is the left eigenvector of P with eigenvalue 1. For finite state spaces it can be solved as a linear system π(PI)=0 subject to iπi=1.

A useful sufficient condition: if there exists π with πiPij=πjPji for all i,j (the detailed balance equation), then π is stationary. Detailed balance gives reversible chains — the structural building block of MCMC.

Mixing time and the spectral gap

How fast does πt approach π? For irreducible aperiodic finite chains, convergence is geometric:

πtπTVCλ2t,

where λ2 is the second-largest eigenvalue of P in absolute value. The spectral gap 1|λ2| is the rate of convergence; the mixing time tmix is roughly log(1/ϵ)/(1|λ2|).

The spectral gap is what determines whether MCMC inference is fast or slow. Multimodal target distributions typically produce small gaps and slow mixing; this is one of the central practical problems with MCMC.

MCMC: building chains for inference

Markov Chain Monte Carlo runs the logic in reverse: given a target distribution π we want to sample from, construct a chain whose stationary distribution is π.

  • Metropolis-Hastings — propose moves from a proposal distribution q, accept with probability min(1,π(x)q(xx)/π(x)q(xx)). Detailed balance gives stationarity.
  • Gibbs sampling — sample each variable from its conditional given the others. Special case of MH with proposals that are always accepted.
  • Hamiltonian MC — use gradient information to take long, low-rejection moves; fast mixing for continuous high-dim targets.

MCMC is the workhorse of Bayesian inference, posterior sampling, and any setting where you can evaluate π up to a normalising constant but cannot sample directly.

Where Markov chains appear in ML

  • Reinforcement learningMDPs are Markov chains with actions and rewards.
  • Hidden Markov Models — see HMM.
  • n-gram language models — text as an order-n Markov chain over tokens.
  • MCMC inference — Bayesian posterior sampling.
  • PageRank — Google's original ranking is the stationary distribution of a Markov chain on the web graph.
  • Chain-of-thought reasoning at inference — autoregressive LM generation is a Markov chain over token sequences (the state being the prefix).

Released under the MIT License. Content imported and adapted from NoteNextra.