Decision Trees

A decision tree is a recursive partitioning of feature space — at each internal node, split on a single feature with a threshold; each leaf has a predicted label or value. Single trees are interpretable but high-variance; ensembled into random forests and gradient-boosted trees they become some of the strongest classical-ML models, especially on tabular data.

How a tree predicts

Each leaf $L$ stores a prediction:

Classification — majority class of training points in $L$ , or class probabilities proportional to counts.
Regression — mean of $y$ values in $L$ .

To predict for a new $x$ , walk from root to leaf using each node's split rule.

How a tree is built (greedy recursion)

The algorithm is depth-first greedy:

At the current node with training set $S$ , pick the (feature, threshold) pair that maximises a split-quality measure.
Split $S$ into $S_{left}$ and $S_{right}$ accordingly.
Recurse on each subset.
Stop when a stopping criterion is met (max depth, min samples per leaf, no improvement).

The choice of split-quality measure defines the algorithm.

Split criteria

For classification, the standard measures are impurity reductions:

Gini impurity — $G = 1 - \sum_{k} p_{k}^{2}$ . Minimised when one class dominates.
Entropy — $H = - \sum_{k} p_{k} \log p_{k}$ . The information-theoretic measure.

A split's quality is the weighted-average impurity of children, subtracted from the parent's impurity (the information gain). Gini and entropy give very similar trees — Gini is slightly cheaper to compute and is CART's default.

For regression, the standard split is variance reduction: minimise $\sum_{children} N_{c} \cdot Var (y_{c})$ . Equivalent to minimising squared error if the leaf prediction is the mean.

Three classical algorithms

ID3 (Quinlan, 1986) — entropy-based, categorical features only, no pruning.
C4.5 (Quinlan, 1993) — handles continuous features, missing values, post-hoc pruning. Information-gain-ratio normalises against split fan-out.
CART (Breiman et al., 1984) — Classification and Regression Trees. Binary splits, Gini impurity for classification, squared error for regression. The basis of most modern implementations including scikit-learn's.

Overfitting and pruning

A fully-grown tree can perfectly memorise the training set — every leaf is a single training point. This generalises terribly. Two control strategies:

Pre-pruning — stop splitting early. Limit max depth, min samples per leaf, or minimum impurity decrease.
Post-pruning — grow the tree fully, then collapse subtrees that don't help on a validation set. Cost-complexity pruning (CART's α-pruning) selects the subtree minimising $\sum (leaf error) + α \cdot | leaves |$ .

Even with pruning, single trees have high variance — small changes in the training set produce very different trees. This is the central motivation for ensembles.

Strengths and weaknesses

Strengths:

Interpretable — easy to read, even for non-experts. Each prediction has a clear path of "if-then" rules.
No feature scaling needed — splits are scale-invariant.
Handles mixed types — continuous, categorical, ordinal features all work.
Handles missing values — surrogate splits in CART, or just sending NaN down a chosen branch in modern implementations.
Captures non-linear interactions automatically.

Weaknesses:

High variance as a single tree. Mostly fixed by ensembling.
Greedy — no global optimisation; bad early splits can't be undone.
Axis-aligned only — diagonal decision boundaries require many splits.
Extrapolation fails — predictions are constant within each leaf, so trees don't extrapolate beyond their training range.

Why trees ensemble so well

Trees are perfect base learners for ensembles because they are high-variance, low-bias — averaging or boosting many trees reduces variance without reintroducing bias. The next three pages — random forests, AdaBoost, gradient boosting — each operationalise this differently.

Decision Trees ​

How a tree predicts ​

How a tree is built (greedy recursion) ​

Split criteria ​

Three classical algorithms ​

Overfitting and pruning ​

Strengths and weaknesses ​

Why trees ensemble so well ​

What to read next ​