Information Theory
Information theory shows up everywhere in modern ML: the cross-entropy loss, the KL term in a VAE's ELBO, the mutual-information bound behind contrastive learning, the channel-capacity intuition behind compression-as-intelligence. This article is the minimum vocabulary needed before any of those make sense.
Shannon entropy
For a discrete random variable
It is the expected number of bits (if
Cross-entropy
For two distributions
This is the cross-entropy loss that nearly every classifier minimises.
KL divergence
It measures the extra bits paid by encoding
Mutual information
How much knowing
Channel capacity (preview)
For a noisy channel with input
the maximum reliable communication rate. This is Shannon's celebrated theorem and the entry point into the lecture sequence in information-theory-notes/.
What to read next
- The full lecture sequence imported from NoteNextra · CSE5313.
- Probability & Statistics Primer — assumed background.
- VAE — KL appears as the regulariser in the ELBO.
Stub status
Seed introduction. Expand with differential entropy, Jensen's inequality proofs, data-processing inequality, and Fano's inequality.