Bagging & Random Forests
A single decision tree has high variance — small changes in the training set produce wildly different trees. Bagging (Bootstrap AGGregatING; Breiman, 1996) reduces this variance by averaging many trees fit on bootstrap resamples. Random forests (Breiman, 2001) extend bagging with random feature subsampling at each split, decorrelating the individual trees and giving the strongest variance reduction. Random forests have remained one of the most reliable tabular-data baselines for two decades.
Bagging
Given a training set
For classification, majority vote instead of average. The variance of the average is
where
Bagging works well only if the base learner is unstable (high-variance). Decision trees fit the bill perfectly. Bagging stable learners (linear models) gives little improvement.
Out-of-bag estimation
Each bootstrap sample omits about
OOB error tracks test error closely, eliminating the need for a separate held-out validation set in many practical settings.
Random forests
Bagged trees are correlated through their shared training distribution. Random forests reduce
Standard choices:
- Classification:
. - Regression:
.
This injects independent randomness into every split, sharply reducing
Other practical defaults:
- Grow each tree fully (no pruning) — bias is already low; only variance matters.
trees. Performance saturates around 500 for most problems.
Feature importance
Random forests provide two natural importance measures:
- Mean decrease in impurity (MDI / Gini importance) — sum of impurity decreases over all splits using a feature, averaged across trees. Cheap but biased toward high-cardinality features.
- Permutation importance — measure OOB accuracy, then permute one column and re-measure; the drop is that feature's importance. Slower but unbiased.
Both are diagnostic, not formal hypothesis tests. Random forest feature importances are the standard quick-look tool when you want to know "what matters?" in a tabular problem.
When random forests win
- Tabular data, mixed types — RFs are typically the strongest single-method baseline.
- Limited data — OOB estimation gives clean cross-validation almost for free.
- You need a calibration of feature importance.
- No tuning budget — RFs work well out-of-the-box; the hyperparameters are forgiving.
When gradient boosting (XGBoost, LightGBM, CatBoost) wins:
- Slightly higher accuracy ceiling on most tabular benchmarks at the cost of more careful hyperparameter tuning.
- Both belong in the toolkit — RFs as the no-tune baseline, GBMs as the tuned competitor.
What to read next
- Decision Trees — the base learner.
- Gradient Boosting — the sequential alternative to bagging.
- AdaBoost — the historical predecessor of gradient boosting.