Cross-Validation & Model Selection
Training error is biased downward — the model has seen the training data and gets to fit it. To estimate true generalisation error, you need data the model has not seen. Cross-validation is the standard technique for estimating this from a limited dataset, and the foundation of essentially all model selection in classical ML.
The train/validation/test split
The cleanest setup, when data is plentiful:
- Training set — fit model parameters.
- Validation set — tune hyperparameters and pick architectures.
- Test set — final evaluation; touched at most once.
Touching the test set during model development biases the estimate (you implicitly select for noise patterns specific to that set). The 70/15/15 or 80/10/10 split is conventional but not load-bearing — what matters is that the test set is large enough for a reliable estimate and is treated as a one-shot resource.
-fold cross-validation
When the dataset is too small for a fixed validation split,
- Split the data into
disjoint folds. - For each fold
: train on the other folds, evaluate on fold . - Average the
validation scores.
Common choices:
Leave-one-out CV (
Stratified and time-series CV
Two common pitfalls:
- Class imbalance. Random folds can produce splits with unrepresentative class proportions. Stratified CV keeps each fold's class distribution close to the dataset's.
- Temporal data. Random splits leak future-into-past. For time series, use forward-chaining CV: fold
is everything before time for training, for validation. This is what's needed for any deployment context where you predict future from past.
Hyperparameter selection: nested CV
If you use CV for both hyperparameter selection and final evaluation on the same folds, you've validated hyperparameters on the very data used to score the model — biased upward. Nested CV addresses this:
- Outer loop
-fold for evaluation. - Inner loop
-fold within each outer training set for hyperparameter selection.
Quadratic compute cost (
Information criteria as an alternative
When CV is too expensive (huge models, many candidate configurations), information criteria estimate generalisation analytically from the fitted model on the training set alone:
- AIC
— Akaike Information Criterion. Approximates KL distance from the true distribution. - BIC
— Bayesian Information Criterion. Approximates the marginal likelihood.
Both penalise model size
Modern caveats
- Deep learning — full
-fold CV on million-parameter networks is prohibitively expensive. The default is a single train/val/test split with the validation set used for early stopping and architecture choices. - Large pretraining — for foundation-model pretraining, "validation" is often a small held-out slice plus a portfolio of downstream evaluations. Cross-validation in the classical sense is rarely used.
- Test-set leakage on the web — modern test benchmarks risk being scraped into pretraining corpora. See test-set contamination.
What to read next
- Bias-Variance Tradeoff — what the validation curve is measuring.
- Generalization & VC Dimension — the theoretical complement to empirical validation.
- Regularization — the most common hyperparameter cross-validation tunes.