Skip to content

Probability & Statistics Primer

Machine learning is mostly applied probability. Models are families of probability distributions; training is selecting one from the family; evaluation is comparing the selected distribution to held-out data. This page is the minimum probability/statistics vocabulary the rest of the curriculum assumes.

Random variables and distributions

A random variable X is a function from a sample space Ω to R (or Rn for vectors). It has a distribution described by its CDF FX(x)=P(Xx) and (for continuous variables) a PDF pX(x)=dFX/dx.

Three distributions to know cold:

  • Bernoulli(p) — single coin flip; Var=p(1p).
  • Gaussian (N(μ,σ2)) — p(x)=12πσexp((xμ)2/2σ2). The default noise model and the limit of every well-behaved CLT-style sum.
  • Categorical(π) — multi-class generalisation of Bernoulli; P(X=k)=πk. Softmax outputs are categorical parameters.

Expectation, variance, covariance

For a function f of a random variable X,

E[f(X)]=f(x)p(x)dx,Var(X)=E[(XE[X])2].

Linearity of expectation: E[aX+bY]=aE[X]+bE[Y] — even when X,Y are dependent.

For two random variables, covariance Cov(X,Y)=E[(XE[X])(YE[Y])] measures linear dependence; correlation ρ=Cov/(σXσY) scales it to [1,1]. Independence implies Cov=0 but not vice versa.

Conditional probability and Bayes' rule

The conditional probability of A given B is P(AB)=P(AB)/P(B). Bayes' rule rearranges it:

P(AB)=P(BA)P(A)P(B).

In machine-learning terms, P(θD)P(Dθ)P(θ) — posterior likelihood × prior. This is the central equation of Bayesian inference.

Conditional independence: XYZ iff P(X,YZ)=P(XZ)P(YZ). Conditional-independence structure is what graphical models (Bayes nets, HMMs, CRFs) exploit to factorise high-dimensional joints into tractable products.

Maximum likelihood estimation

Given a parametric model pθ(x) and i.i.d. data D={x1,,xN}, the maximum-likelihood estimator is

θ^MLE=argmaxθipθ(xi)=argmaxθilogpθ(xi).

Almost every loss function in ML is a negative log-likelihood under some probabilistic model:

  • MSE = NLL of Gaussian noise with fixed variance.
  • Cross-entropy = NLL of a categorical model.
  • Binary cross-entropy = NLL of a Bernoulli.

This is why training-via-gradient-descent on these losses is just MLE under SGD.

Concentration inequalities

How far does a sample mean stray from the true mean? Three answers in increasing strength:

  • Markov: P(|X|t)E[|X|]/t.
  • Chebyshev: P(|Xμ|kσ)1/k2.
  • Hoeffding: for bounded i.i.d. Xi[a,b], P(|X¯nμ|t)2exp(2nt2/(ba)2).

These bounds are the analytical foundation of PAC learning and the generalisation guarantees in classical ML theory.

Statistical tests and confidence intervals

Hypothesis testing — null hypothesis H0, test statistic, p-value — formalises "did this model do better than chance?" The mainstream ML mistakes here are:

  • Multiple testing — running many comparisons inflates false-positive rate; correct with Bonferroni or BH.
  • Optional stopping — peeking at the test set during model selection invalidates the test.
  • Confidence interval misinterpretation — a 95% CI is not "95% probability the true value is here"; it is "the procedure produces an interval covering the truth in 95% of repetitions".

For ML practitioners, the most useful tool is the bootstrap: resample with replacement to estimate sampling distributions of any statistic. It avoids most of the above pitfalls and works whenever you have enough data to resample.

Released under the MIT License. Content imported and adapted from NoteNextra.