Skip to content

Hidden Markov Models

A Hidden Markov Model is a Markov chain whose states cannot be observed directly — instead, you see noisy observations conditioned on the state. HMMs are the canonical model for sequence data with discrete latent dynamics: speech recognition (1980s–2010s), part-of-speech tagging, gene-finding, and time-series segmentation. They illustrate the three classical inference problems — filtering, smoothing, and learning — in their cleanest form.

The model

An HMM has three components:

  • Transition probabilities Aij=P(zt+1=jzt=i) over K hidden states.
  • Emission probabilities Bi(x)=P(xtzt=i). Common choices: categorical (discrete observations), Gaussian (continuous), or mixture of Gaussians.
  • Initial distribution πi=P(z1=i).

The joint over hidden states z1:T and observations x1:T factorises as

P(x1:T,z1:T)=πz1Bz1(x1)t=1T1Azt,zt+1Bzt+1(xt+1).

The conditional independence structure: zt+1z1:t1zt, and $\mathbf{x}_t \perp $ everything else zt.

Three classical problems

Rabiner's A Tutorial on Hidden Markov Models (1989) is the canonical reference. Three computations cover all the use cases:

  1. Likelihood / Filtering — given parameters θ=(A,B,π) and observations x1:T, compute P(x1:Tθ) and P(ztx1:t). Forward algorithm.
  2. Decoding — find the most likely hidden sequence: argmaxz1:TP(z1:Tx1:T). Viterbi algorithm.
  3. Learning — estimate θ from observations alone. Baum-Welch (EM).

Each is O(K2T) — quadratic in the number of states, linear in time.

Forward algorithm

Define αt(i)=P(x1:t,zt=i). Recurrence:

αt+1(j)=Bj(xt+1)i=1Kαt(i)Aij.

Initialise α1(i)=πiBi(x1). The total likelihood is P(x1:T)=iαT(i). The filtering posterior is P(zt=ix1:t)αt(i).

Viterbi algorithm

Replace sum with max:

δt+1(j)=Bj(xt+1)maxiδt(i)Aij.

Track back-pointers and recover the optimal sequence by following them from argmaxjδT(j) backward. Viterbi is dynamic programming on the trellis of state-time pairs — the same structure that powers CTC decoding and beam search.

Baum-Welch — EM for HMMs

When θ is unknown, fit by EM. The E-step computes posterior state occupancies γt(i)=P(zt=ix,θold) and transition counts ξt(i,j)=P(zt=i,zt+1=jx,θold) via forward-backward (forward α + backward β).

The M-step updates parameters by averaging:

Aijtξt(i,j),Bi(xk)t:xt=xkγt(i),πi=γ1(i).

Each iteration monotonically increases the data likelihood. As with any EM, multiple restarts and good initialisation (e.g., k-means on Gaussian emissions) matter.

What HMMs were used for

  • Speech recognition — phoneme HMMs were the foundation of every commercial ASR system from the 1980s through the 2010s, before end-to-end neural models took over (DeepSpeech, RNN-T, Whisper).
  • Part-of-speech tagging — Viterbi over a small POS state space, replaced by neural taggers around 2015.
  • Bioinformatics — gene finding, protein-family alignment (HMMER), CpG island detection. Still in active use.
  • Time-series segmentation — change-point detection, regime modelling in finance.

Why HMMs lost to RNNs and Transformers

HMMs assume:

  • Discrete latent state of fixed size K. Modelling rich context requires exponential K.
  • Independence of xt from history given zt. Real sequences have long-range dependencies HMMs cannot capture.

RNNs replaced the discrete latent with a continuous hidden state of arbitrary expressivity; Transformers further removed the Markov assumption entirely. HMMs persist in low-data, interpretable, or tightly structured domains; the inference machinery (forward-backward, Viterbi) generalises to neural CRFs and structured-output networks.

Released under the MIT License. Content imported and adapted from NoteNextra.