Naive Bayes

Naive Bayes is the simplest generative classifier: model $P (x ∣ y)$ as a product of feature-wise distributions and apply Bayes' rule. Despite the conspicuously wrong independence assumption, it works surprisingly well — particularly for text classification — and remains a useful baseline whenever feature dimensions are high and labelled data is scarce.

The model

Bayes' rule for classification:

P (y ∣ x) = \frac{P (x ∣ y) P (y)}{P (x)} \propto P (x ∣ y) P (y) .

The "naive" assumption is conditional independence of features given the label:

P (x ∣ y) = \prod_{j = 1}^{d} P (x_{j} ∣ y) .

This collapses an exponentially-large joint distribution into $d$ small marginals, making both estimation and inference tractable. Predict $\hat{y} = \arg max_{y} P (y) \prod_{j} P (x_{j} ∣ y)$ .

Three flavours

The variant depends on the assumed marginal $P (x_{j} ∣ y)$ :

Gaussian Naive Bayes — $P (x_{j} ∣ y) = N (μ_{j y}, σ_{j y}^{2})$ . Continuous features, two parameters per (feature, class). Good baseline for low-dimensional continuous data.
Multinomial Naive Bayes — $P (x ∣ y)$ is multinomial over feature counts. Standard for text classification with bag-of-words counts.
Bernoulli Naive Bayes — features are 0/1 indicators. For binary text features (word present/absent).

Estimation: closed-form MLE

For a multinomial model with class $y$ and word $j$ count $n_{j y}$ in class- $y$ documents:

\hat{P} (x_{j} ∣ y) = \frac{n_{j y} + α}{\sum_{k} n_{k y} + α V},

with $α > 0$ a Laplace smoothing parameter (typically 1) and $V$ the vocabulary size. Smoothing is critical: a word never seen in class $y$ at training time would otherwise zero out the entire product, regardless of all the evidence from other features.

Class prior $\hat{P} (y) = N_{y} / N$ . Total training cost: one pass over the data to count.

Why does it work despite the bad assumption?

Features in real data are rarely independent given the class. Words in a document are correlated by topic; pixel values are correlated by neighbourhood. The independence assumption is wrong.

But the decision boundary depends only on the ranking of $P (y ∣ x)$ across classes, not on the absolute values. As long as the relative ordering survives the misspecification, the classification is correct. Domingos & Pazzani (1997) gave the canonical analysis of why Naive Bayes is "optimal under zero-one loss for a much broader class of distributions than the one it represents".

The probability estimates are typically miscalibrated (over-confident, with values pushed to 0 or 1), but the argmax is often right.

When Naive Bayes wins

Text classification, small data. The original Naive Bayes spam filter (Sahami et al., 1998) and most early email/SMS spam systems were Naive Bayes. Even in 2025, Multinomial Naive Bayes is a competitive baseline on small text corpora and is the right starting point before throwing transformers at the problem.
High-dimensional, sparse features. When $d ≫ N$ , complex models overfit; Naive Bayes' simple parameterisation is robust.
Streaming / incremental settings. Counts are easy to update online — no retraining required when new data arrives.

When it fails

Strongly correlated features that the independence assumption can't compensate for via class-marginal effects.
Continuous data with non-Gaussian marginals — the Gaussian variant breaks. Use kernel density estimation for $P (x_{j} ∣ y)$ instead.
When you need calibrated probabilities — Naive Bayes is famously over-confident.

Naive Bayes ​

The model ​

Three flavours ​

Estimation: closed-form MLE ​

Why does it work despite the bad assumption? ​

When Naive Bayes wins ​

When it fails ​

What to read next ​