Probability & Statistics Primer

Machine learning is mostly applied probability. Models are families of probability distributions; training is selecting one from the family; evaluation is comparing the selected distribution to held-out data. This page is the minimum probability/statistics vocabulary the rest of the curriculum assumes.

Random variables and distributions

A random variable $X$ is a function from a sample space $Ω$ to $R$ (or $R^{n}$ for vectors). It has a distribution described by its CDF $F_{X} (x) = P (X \leq x)$ and (for continuous variables) a PDF $p_{X} (x) = d F_{X} / d x$ .

Three distributions to know cold:

Bernoulli( $p$ ) — single coin flip; $Var = p (1 - p)$ .
Gaussian ( $N (μ, σ^{2})$ ) — $p (x) = \frac{1}{\sqrt{2 π} σ} \exp (- (x - μ)^{2} / 2 σ^{2})$ . The default noise model and the limit of every well-behaved CLT-style sum.
Categorical( $π$ ) — multi-class generalisation of Bernoulli; $P (X = k) = π_{k}$ . Softmax outputs are categorical parameters.

Expectation, variance, covariance

For a function $f$ of a random variable $X$ ,

E [f (X)] = \int f (x) p (x) d x, Var (X) = E [(X - E [X])^{2}] .

Linearity of expectation: $E [a X + b Y] = a E [X] + b E [Y]$ — even when $X, Y$ are dependent.

For two random variables, covariance $Cov (X, Y) = E [(X - E [X]) (Y - E [Y])]$ measures linear dependence; correlation $ρ = Cov / (σ_{X} σ_{Y})$ scales it to $[- 1, 1]$ . Independence implies $Cov = 0$ but not vice versa.

Conditional probability and Bayes' rule

The conditional probability of $A$ given $B$ is $P (A ∣ B) = P (A \cap B) / P (B)$ . Bayes' rule rearranges it:

P (A ∣ B) = \frac{P (B ∣ A) P (A)}{P (B)} .

In machine-learning terms, $P (θ ∣ D) \propto P (D ∣ θ) P (θ)$ — posterior $\propto$ likelihood × prior. This is the central equation of Bayesian inference.

Conditional independence: $X ⊥ Y ∣ Z$ iff $P (X, Y ∣ Z) = P (X ∣ Z) P (Y ∣ Z)$ . Conditional-independence structure is what graphical models (Bayes nets, HMMs, CRFs) exploit to factorise high-dimensional joints into tractable products.

Maximum likelihood estimation

Given a parametric model $p_{θ} (x)$ and i.i.d. data $D = {x_{1}, \dots, x_{N}}$ , the maximum-likelihood estimator is

{\hat{θ}}_{MLE} = \arg max_{θ} \prod_{i} p_{θ} (x_{i}) = \arg max_{θ} \sum_{i} \log p_{θ} (x_{i}) .

Almost every loss function in ML is a negative log-likelihood under some probabilistic model:

MSE = NLL of Gaussian noise with fixed variance.
Cross-entropy = NLL of a categorical model.
Binary cross-entropy = NLL of a Bernoulli.

This is why training-via-gradient-descent on these losses is just MLE under SGD.

Concentration inequalities

How far does a sample mean stray from the true mean? Three answers in increasing strength:

Markov: $P (| X | \geq t) \leq E [| X |] / t$ .
Chebyshev: $P (| X - μ | \geq k σ) \leq 1 / k^{2}$ .
Hoeffding: for bounded i.i.d. $X_{i} \in [a, b]$ , $P (| {\bar{X}}_{n} - μ | \geq t) \leq 2 \exp (- 2 n t^{2} / (b - a)^{2})$ .

These bounds are the analytical foundation of PAC learning and the generalisation guarantees in classical ML theory.

Statistical tests and confidence intervals

Hypothesis testing — null hypothesis $H_{0}$ , test statistic, p-value — formalises "did this model do better than chance?" The mainstream ML mistakes here are:

Multiple testing — running many comparisons inflates false-positive rate; correct with Bonferroni or BH.
Optional stopping — peeking at the test set during model selection invalidates the test.
Confidence interval misinterpretation — a 95% CI is not "95% probability the true value is here"; it is "the procedure produces an interval covering the truth in 95% of repetitions".

For ML practitioners, the most useful tool is the bootstrap: resample with replacement to estimate sampling distributions of any statistic. It avoids most of the above pitfalls and works whenever you have enough data to resample.

Probability & Statistics Primer ​

Random variables and distributions ​

Expectation, variance, covariance ​

Conditional probability and Bayes' rule ​

Maximum likelihood estimation ​

Concentration inequalities ​

Statistical tests and confidence intervals ​

What to read next ​