AdaBoost
AdaBoost — Adaptive Boosting — was the first practical boosting algorithm and the conceptual breakthrough that "weak learners can be combined into a strong learner". It paved the way for gradient boosting and the entire family of sequentially-trained ensembles. Despite its 1995 vintage, it remains one of the cleanest illustrations of how reweighting hard examples produces strong classifiers.
The algorithm
Given training data
Initialise weights
- Fit weak classifier
to weighted training set, minimising weighted error . - Compute classifier weight
. - Update sample weights:
.
The final classifier is
Misclassified examples have their weights increased; correctly-classified examples have them decreased. The next weak learner is forced to focus on the points its predecessors got wrong.
The weak learner
The canonical weak learner is a decision stump — a one-split tree. A stump per feature is the simplest possible classifier; AdaBoost combines hundreds of stumps to produce a strong classifier. This is the surprising part: linear combinations of stumps cover a much richer hypothesis class than any one stump, and AdaBoost finds an efficient combination.
The weak learner condition: each
Margin theory
AdaBoost dramatically reduces training error — exponentially fast, in fact:
which is
Yet test error often keeps decreasing even after training error reaches zero. Schapire et al.'s margin-based explanation (1998): AdaBoost continues to enlarge the classification margin of training examples even past zero training error, and large margins generalise. This was an early hint at the implicit regularisation that over-parameterised deep networks later exhibited.
Exponential loss
AdaBoost is equivalent to greedy minimisation of the exponential loss:
In each round, the algorithm picks the
Strengths and weaknesses
Strengths:
- Few hyperparameters — number of rounds
is essentially the only knob. - Strong margin guarantees — performs well even past zero training error.
- Provable — clean PAC-style learning bounds.
Weaknesses:
- Sensitive to label noise — exponential loss heavily penalises misclassified points; a mislabelled outlier gets reweighted to dominate the training distribution. Real-world noisy data hurts AdaBoost more than random forests.
- Sequential — cannot be parallelised across rounds (unlike bagging).
- Mostly displaced by gradient boosting — XGBoost / LightGBM use the same sequential idea with smarter losses (logistic, squared) and tree structure.
Legacy
AdaBoost is rarely the right modern choice but is the conceptual ancestor of every modern boosting algorithm. Reading the 1995 paper alongside an XGBoost tutorial makes the lineage clean: same sequential structure, same focus-on-residual mechanic, but generalised loss and tree-specific optimisations.
What to read next
- Gradient Boosting — the modern descendant.
- Decision Trees — the typical weak learner.
- Random Forests — the bagging-based contrast.