Conditional Random Fields
A Conditional Random Field (CRF) is the discriminative counterpart to the HMM. Both model sequence labelling, but where HMMs model the joint
The model
For an input sequence
with feature functions
The two feature types in a linear-chain CRF:
- Transition features
— like HMM's transition matrix. - Emission features
— depend on the entire observation sequence at every position. This is what HMMs cannot do.
Why CRFs beat HMMs for tagging
HMMs assume
CRFs model
- Arbitrary observation features.
can use the whole sentence — capitalization, suffixes, gazetteers, neighbouring words, parsing context. - No need to model
. Discriminative training spends capacity on the boundary, not on the data distribution.
The empirical result: CRFs beat HMMs by 5–10 F1 points on named-entity recognition, POS tagging, and chunking — benchmarks that defined NLP for a decade.
Training and inference
Training. Maximise the conditional log-likelihood
The objective is convex; use L-BFGS or SGD. Gradient computation requires the expected feature counts under the model, computed by forward-backward in
Inference. Predict
Generalisations
- General CRFs — any factor graph, not just linear chains. Arbitrary inference difficulty in general (NP-hard for high-treewidth graphs).
- Tree-structured CRFs — efficient inference via dynamic programming on trees; useful for parsing.
- Skip-chain CRFs — add long-range edges between identical word occurrences. Useful but requires approximate inference.
CRFs as a neural output head
Modern sequence-labelling pipelines often combine an LSTM or Transformer encoder with a CRF output layer:
- The encoder produces a per-token feature vector.
- A learned linear layer produces emission scores per (position, label).
- A learned
matrix gives transition scores. - Forward-backward computes the log-likelihood; Viterbi decodes.
This BiLSTM-CRF architecture (Lample et al., 2016) was the SOTA for NER until full attention models took over. The CRF head's value: it enforces structural constraints (e.g., no I-PER following a B-LOC tag) that pure per-token softmax cannot. Modern implementations like Flair and SpaCy still use a CRF head as an option.
CRFs vs Transformers
Transformer taggers (BERT-NER, RoBERTa-NER) typically use a softmax-per-token head and rely on the Transformer's contextualised representations to capture inter-label dependencies. They beat BiLSTM-CRF on most benchmarks. But:
- For low-resource domains, BiLSTM-CRF still competes.
- For tasks where structural constraints really matter (BIO-tagging, segment-level constraints), a CRF head on top of BERT (BERT-CRF) is sometimes a meaningful improvement.
What to read next
- Hidden Markov Models — the generative counterpart.
- Bayesian Networks — the directed graphical-model framework.
- Vanilla RNNs — the neural sequence model that absorbed CRF's role.