Hidden Markov Models

A Hidden Markov Model is a Markov chain whose states cannot be observed directly — instead, you see noisy observations conditioned on the state. HMMs are the canonical model for sequence data with discrete latent dynamics: speech recognition (1980s–2010s), part-of-speech tagging, gene-finding, and time-series segmentation. They illustrate the three classical inference problems — filtering, smoothing, and learning — in their cleanest form.

The model

An HMM has three components:

Transition probabilities $A_{i j} = P (z_{t + 1} = j ∣ z_{t} = i)$ over $K$ hidden states.
Emission probabilities $B_{i} (x) = P (x_{t} ∣ z_{t} = i)$ . Common choices: categorical (discrete observations), Gaussian (continuous), or mixture of Gaussians.
Initial distribution $π_{i} = P (z_{1} = i)$ .

The joint over hidden states $z_{1 : T}$ and observations $x_{1 : T}$ factorises as

P (x_{1 : T}, z_{1 : T}) = π_{z_{1}} B_{z_{1}} (x_{1}) \prod_{t = 1}^{T - 1} A_{z_{t}, z_{t + 1}} B_{z_{t + 1}} (x_{t + 1}) .

The conditional independence structure: $z_{t + 1} ⊥ z_{1 : t - 1} ∣ z_{t}$ , and $\mathbf{x}_t \perp $ everything else $∣ z_{t}$ .

Three classical problems

Rabiner's A Tutorial on Hidden Markov Models (1989) is the canonical reference. Three computations cover all the use cases:

Likelihood / Filtering — given parameters $θ = (A, B, π)$ and observations $x_{1 : T}$ , compute $P (x_{1 : T} ∣ θ)$ and $P (z_{t} ∣ x_{1 : t})$ . Forward algorithm.
Decoding — find the most likely hidden sequence: $\arg max_{z_{1 : T}} P (z_{1 : T} ∣ x_{1 : T})$ . Viterbi algorithm.
Learning — estimate $θ$ from observations alone. Baum-Welch (EM).

Each is $O (K^{2} T)$ — quadratic in the number of states, linear in time.

Forward algorithm

Define $α_{t} (i) = P (x_{1 : t}, z_{t} = i)$ . Recurrence:

α_{t + 1} (j) = B_{j} (x_{t + 1}) \sum_{i = 1}^{K} α_{t} (i) A_{i j} .

Initialise $α_{1} (i) = π_{i} B_{i} (x_{1})$ . The total likelihood is $P (x_{1 : T}) = \sum_{i} α_{T} (i)$ . The filtering posterior is $P (z_{t} = i ∣ x_{1 : t}) \propto α_{t} (i)$ .

Viterbi algorithm

Replace sum with max:

δ_{t + 1} (j) = B_{j} (x_{t + 1}) max_{i} δ_{t} (i) A_{i j} .

Track back-pointers and recover the optimal sequence by following them from $\arg max_{j} δ_{T} (j)$ backward. Viterbi is dynamic programming on the trellis of state-time pairs — the same structure that powers CTC decoding and beam search.

Baum-Welch — EM for HMMs

When $θ$ is unknown, fit by EM. The E-step computes posterior state occupancies $γ_{t} (i) = P (z_{t} = i ∣ x, θ^{old})$ and transition counts $ξ_{t} (i, j) = P (z_{t} = i, z_{t + 1} = j ∣ x, θ^{old})$ via forward-backward (forward $α$ + backward $β$ ).

The M-step updates parameters by averaging:

A_{i j} \propto \sum_{t} ξ_{t} (i, j), B_{i} (x_{k}) \propto \sum_{t : x_{t} = x_{k}} γ_{t} (i), π_{i} = γ_{1} (i) .

Each iteration monotonically increases the data likelihood. As with any EM, multiple restarts and good initialisation (e.g., k-means on Gaussian emissions) matter.

What HMMs were used for

Speech recognition — phoneme HMMs were the foundation of every commercial ASR system from the 1980s through the 2010s, before end-to-end neural models took over (DeepSpeech, RNN-T, Whisper).
Part-of-speech tagging — Viterbi over a small POS state space, replaced by neural taggers around 2015.
Bioinformatics — gene finding, protein-family alignment (HMMER), CpG island detection. Still in active use.
Time-series segmentation — change-point detection, regime modelling in finance.

Why HMMs lost to RNNs and Transformers

HMMs assume:

Discrete latent state of fixed size $K$ . Modelling rich context requires exponential $K$ .
Independence of $x_{t}$ from history given $z_{t}$ . Real sequences have long-range dependencies HMMs cannot capture.

RNNs replaced the discrete latent with a continuous hidden state of arbitrary expressivity; Transformers further removed the Markov assumption entirely. HMMs persist in low-data, interpretable, or tightly structured domains; the inference machinery (forward-backward, Viterbi) generalises to neural CRFs and structured-output networks.

Hidden Markov Models ​

The model ​

Three classical problems ​

Forward algorithm ​

Viterbi algorithm ​

Baum-Welch — EM for HMMs ​

What HMMs were used for ​

Why HMMs lost to RNNs and Transformers ​

What to read next ​