Probability & Statistics Primer
Machine learning is mostly applied probability. Models are families of probability distributions; training is selecting one from the family; evaluation is comparing the selected distribution to held-out data. This page is the minimum probability/statistics vocabulary the rest of the curriculum assumes.
Random variables and distributions
A random variable
Three distributions to know cold:
- Bernoulli(
) — single coin flip; . - Gaussian (
) — . The default noise model and the limit of every well-behaved CLT-style sum. - Categorical(
) — multi-class generalisation of Bernoulli; . Softmax outputs are categorical parameters.
Expectation, variance, covariance
For a function
Linearity of expectation:
For two random variables, covariance
Conditional probability and Bayes' rule
The conditional probability of
In machine-learning terms,
Conditional independence:
Maximum likelihood estimation
Given a parametric model
Almost every loss function in ML is a negative log-likelihood under some probabilistic model:
- MSE = NLL of Gaussian noise with fixed variance.
- Cross-entropy = NLL of a categorical model.
- Binary cross-entropy = NLL of a Bernoulli.
This is why training-via-gradient-descent on these losses is just MLE under SGD.
Concentration inequalities
How far does a sample mean stray from the true mean? Three answers in increasing strength:
- Markov:
. - Chebyshev:
. - Hoeffding: for bounded i.i.d.
, .
These bounds are the analytical foundation of PAC learning and the generalisation guarantees in classical ML theory.
Statistical tests and confidence intervals
Hypothesis testing — null hypothesis
- Multiple testing — running many comparisons inflates false-positive rate; correct with Bonferroni or BH.
- Optional stopping — peeking at the test set during model selection invalidates the test.
- Confidence interval misinterpretation — a 95% CI is not "95% probability the true value is here"; it is "the procedure produces an interval covering the truth in 95% of repetitions".
For ML practitioners, the most useful tool is the bootstrap: resample with replacement to estimate sampling distributions of any statistic. It avoids most of the above pitfalls and works whenever you have enough data to resample.
What to read next
- Linear Algebra Recap — covariance matrices, multivariate Gaussians.
- Information Theory — entropy, KL, and mutual information build on probability.
- Bayes Nets — structured probability models.