Information Theory

Information theory shows up everywhere in modern ML: the cross-entropy loss, the KL term in a VAE's ELBO, the mutual-information bound behind contrastive learning, the channel-capacity intuition behind compression-as-intelligence. This article is the minimum vocabulary needed before any of those make sense.

Shannon entropy

For a discrete random variable $X$ with distribution $p$ ,

H (X) = - \sum_{x} p (x) \log p (x) .

It is the expected number of bits (if $\log = \log_{2}$ ) needed to encode a sample of $X$ under an optimal code.

Cross-entropy

For two distributions $p$ (true) and $q$ (model),

H (p, q) = - \sum_{x} p (x) \log q (x) .

This is the cross-entropy loss that nearly every classifier minimises.

KL divergence

D_{KL} (p ∥ q) = \sum_{x} p (x) \log \frac{p (x)}{q (x)} = H (p, q) - H (p) .

It measures the extra bits paid by encoding $p$ with a code optimised for $q$ . KL is non-negative, asymmetric, and vanishes iff $p = q$ .

Mutual information

I (X; Y) = D_{KL} (p (x, y) ∥ p (x) p (y)) .

How much knowing $Y$ reduces uncertainty about $X$ . Modern self-supervised methods (InfoNCE, SimCLR) maximise a tractable lower bound on $I ({view}_{1}; {view}_{2})$ .

Channel capacity (preview)

For a noisy channel with input $X$ and output $Y$ , the capacity is

C = max_{p (x)} I (X; Y),

the maximum reliable communication rate. This is Shannon's celebrated theorem and the entry point into the lecture sequence in information-theory-notes/.

Information Theory ​

Shannon entropy ​

Cross-entropy ​

KL divergence ​

Mutual information ​

Channel capacity (preview) ​

What to read next ​