Decision Trees
A decision tree is a recursive partitioning of feature space — at each internal node, split on a single feature with a threshold; each leaf has a predicted label or value. Single trees are interpretable but high-variance; ensembled into random forests and gradient-boosted trees they become some of the strongest classical-ML models, especially on tabular data.
How a tree predicts
Each leaf
- Classification — majority class of training points in
, or class probabilities proportional to counts. - Regression — mean of
values in .
To predict for a new
How a tree is built (greedy recursion)
The algorithm is depth-first greedy:
- At the current node with training set
, pick the (feature, threshold) pair that maximises a split-quality measure. - Split
into and accordingly. - Recurse on each subset.
- Stop when a stopping criterion is met (max depth, min samples per leaf, no improvement).
The choice of split-quality measure defines the algorithm.
Split criteria
For classification, the standard measures are impurity reductions:
- Gini impurity —
. Minimised when one class dominates. - Entropy —
. The information-theoretic measure.
A split's quality is the weighted-average impurity of children, subtracted from the parent's impurity (the information gain). Gini and entropy give very similar trees — Gini is slightly cheaper to compute and is CART's default.
For regression, the standard split is variance reduction: minimise
Three classical algorithms
- ID3 (Quinlan, 1986) — entropy-based, categorical features only, no pruning.
- C4.5 (Quinlan, 1993) — handles continuous features, missing values, post-hoc pruning. Information-gain-ratio normalises against split fan-out.
- CART (Breiman et al., 1984) — Classification and Regression Trees. Binary splits, Gini impurity for classification, squared error for regression. The basis of most modern implementations including scikit-learn's.
Overfitting and pruning
A fully-grown tree can perfectly memorise the training set — every leaf is a single training point. This generalises terribly. Two control strategies:
- Pre-pruning — stop splitting early. Limit max depth, min samples per leaf, or minimum impurity decrease.
- Post-pruning — grow the tree fully, then collapse subtrees that don't help on a validation set. Cost-complexity pruning (CART's α-pruning) selects the subtree minimising
.
Even with pruning, single trees have high variance — small changes in the training set produce very different trees. This is the central motivation for ensembles.
Strengths and weaknesses
Strengths:
- Interpretable — easy to read, even for non-experts. Each prediction has a clear path of "if-then" rules.
- No feature scaling needed — splits are scale-invariant.
- Handles mixed types — continuous, categorical, ordinal features all work.
- Handles missing values — surrogate splits in CART, or just sending NaN down a chosen branch in modern implementations.
- Captures non-linear interactions automatically.
Weaknesses:
- High variance as a single tree. Mostly fixed by ensembling.
- Greedy — no global optimisation; bad early splits can't be undone.
- Axis-aligned only — diagonal decision boundaries require many splits.
- Extrapolation fails — predictions are constant within each leaf, so trees don't extrapolate beyond their training range.
Why trees ensemble so well
Trees are perfect base learners for ensembles because they are high-variance, low-bias — averaging or boosting many trees reduces variance without reintroducing bias. The next three pages — random forests, AdaBoost, gradient boosting — each operationalise this differently.
What to read next
- Random Forests — bagging trees + feature subsampling.
- Gradient Boosting — sequentially fit trees to residuals.
- AdaBoost — the historical first ensemble of stumps.