Conditional Random Fields

A Conditional Random Field (CRF) is the discriminative counterpart to the HMM. Both model sequence labelling, but where HMMs model the joint $P (x, y)$ and require strong independence assumptions on the observations, CRFs model the conditional $P (y ∣ x)$ directly and let observations interact arbitrarily. CRFs were the dominant sequence-labelling tool from 2001 to about 2017 and persist as the structured-output head of many modern neural taggers.

The model

For an input sequence $x = (x_{1}, \dots, x_{T})$ and a label sequence $y = (y_{1}, \dots, y_{T})$ , a linear-chain CRF defines

P (y ∣ x) = \frac{1}{Z (x)} \exp (\sum_{t = 1}^{T} \sum_{k} λ_{k} f_{k} (y_{t - 1}, y_{t}, x, t)),

with feature functions $f_{k}$ and weights $λ_{k}$ . The normaliser $Z (x) = \sum_{y} \exp (\dots)$ is a sum over all $K^{T}$ possible label sequences — but it factorises along the chain, so it can be computed in $O (K^{2} T)$ via the forward algorithm.

The two feature types in a linear-chain CRF:

Transition features $f (y_{t - 1}, y_{t})$ — like HMM's transition matrix.
Emission features $f (y_{t}, x, t)$ — depend on the entire observation sequence at every position. This is what HMMs cannot do.

Why CRFs beat HMMs for tagging

HMMs assume $x_{t} ⊥ x_{1 : T ∖ t} ∣ y_{t}$ — observations are conditionally independent given their label. Real text violates this: in named-entity recognition, "Bill" is more likely a name because the next word is "Clinton". HMMs' generative assumption blocks the model from exploiting it.

CRFs model $P (y ∣ x)$ directly, so:

Arbitrary observation features. $f (y_{t}, x, t)$ can use the whole sentence — capitalization, suffixes, gazetteers, neighbouring words, parsing context.
No need to model $P (x)$ . Discriminative training spends capacity on the boundary, not on the data distribution.

The empirical result: CRFs beat HMMs by 5–10 F1 points on named-entity recognition, POS tagging, and chunking — benchmarks that defined NLP for a decade.

Training and inference

Training. Maximise the conditional log-likelihood

L (λ) = \sum_{n} \log P (y^{(n)} ∣ x^{(n)}; λ) - \frac{C}{2} ∥ λ ∥^{2} .

The objective is convex; use L-BFGS or SGD. Gradient computation requires the expected feature counts under the model, computed by forward-backward in $O (K^{2} T)$ per sequence.

Inference. Predict $\hat{y} = \arg max_{y} P (y ∣ x)$ via Viterbi, exactly as in HMMs.

Generalisations

General CRFs — any factor graph, not just linear chains. Arbitrary inference difficulty in general (NP-hard for high-treewidth graphs).
Tree-structured CRFs — efficient inference via dynamic programming on trees; useful for parsing.
Skip-chain CRFs — add long-range edges between identical word occurrences. Useful but requires approximate inference.

CRFs as a neural output head

Modern sequence-labelling pipelines often combine an LSTM or Transformer encoder with a CRF output layer:

The encoder produces a per-token feature vector.
A learned linear layer produces emission scores per (position, label).
A learned $K \times K$ matrix gives transition scores.
Forward-backward computes the log-likelihood; Viterbi decodes.

This BiLSTM-CRF architecture (Lample et al., 2016) was the SOTA for NER until full attention models took over. The CRF head's value: it enforces structural constraints (e.g., no I-PER following a B-LOC tag) that pure per-token softmax cannot. Modern implementations like Flair and SpaCy still use a CRF head as an option.

CRFs vs Transformers

Transformer taggers (BERT-NER, RoBERTa-NER) typically use a softmax-per-token head and rely on the Transformer's contextualised representations to capture inter-label dependencies. They beat BiLSTM-CRF on most benchmarks. But:

For low-resource domains, BiLSTM-CRF still competes.
For tasks where structural constraints really matter (BIO-tagging, segment-level constraints), a CRF head on top of BERT (BERT-CRF) is sometimes a meaningful improvement.

Conditional Random Fields ​

The model ​

Why CRFs beat HMMs for tagging ​

Training and inference ​

Generalisations ​

CRFs as a neural output head ​

CRFs vs Transformers ​

What to read next ​