Naive Bayes
Naive Bayes is the simplest generative classifier: model
The model
Bayes' rule for classification:
The "naive" assumption is conditional independence of features given the label:
This collapses an exponentially-large joint distribution into
Three flavours
The variant depends on the assumed marginal
- Gaussian Naive Bayes —
. Continuous features, two parameters per (feature, class). Good baseline for low-dimensional continuous data. - Multinomial Naive Bayes —
is multinomial over feature counts. Standard for text classification with bag-of-words counts. - Bernoulli Naive Bayes — features are 0/1 indicators. For binary text features (word present/absent).
Estimation: closed-form MLE
For a multinomial model with class
with
Class prior
Why does it work despite the bad assumption?
Features in real data are rarely independent given the class. Words in a document are correlated by topic; pixel values are correlated by neighbourhood. The independence assumption is wrong.
But the decision boundary depends only on the ranking of
The probability estimates are typically miscalibrated (over-confident, with values pushed to 0 or 1), but the argmax is often right.
When Naive Bayes wins
- Text classification, small data. The original Naive Bayes spam filter (Sahami et al., 1998) and most early email/SMS spam systems were Naive Bayes. Even in 2025, Multinomial Naive Bayes is a competitive baseline on small text corpora and is the right starting point before throwing transformers at the problem.
- High-dimensional, sparse features. When
, complex models overfit; Naive Bayes' simple parameterisation is robust. - Streaming / incremental settings. Counts are easy to update online — no retraining required when new data arrives.
When it fails
- Strongly correlated features that the independence assumption can't compensate for via class-marginal effects.
- Continuous data with non-Gaussian marginals — the Gaussian variant breaks. Use kernel density estimation for
instead. - When you need calibrated probabilities — Naive Bayes is famously over-confident.
What to read next
- Logistic Regression — the discriminative cousin; the comparison "Naive Bayes vs Logistic Regression" (Ng & Jordan, NIPS 2001) is a classic.
- Bayesian Networks — the generalisation that drops the independence assumption.
- LDA & QDA — Gaussian Naive Bayes with shared / non-shared covariance.