Word Embeddings
The first idea that made deep learning workable for language: represent each word as a dense vector in
The skip-gram intuition
Given a corpus, slide a window of size
Maximising the log-likelihood over the corpus pulls together the embeddings of words that share contexts ("you shall know a word by the company it keeps").
Negative sampling
The denominator above sums over the entire vocabulary — too expensive. Word2Vec replaces the full softmax with a binary classifier that distinguishes a true context word from
This is the actual loss in Mikolov et al. and what made training tractable on billion-token corpora.
Subword embeddings (fastText)
A serious limitation of Word2Vec: each word is opaque. playing and plays have unrelated embeddings if either is rare. fastText decomposes each word into character
This generalises to out-of-vocabulary words and dramatically improves morphologically rich languages.
Linear analogies
A famous side-effect:
Why this still matters
Modern subword tokenizers (BPE, WordPiece) and the embedding tables they index are direct descendants of fastText's subword idea. Every LLM today still starts with embedding[token_id].
Reading list
- Distributed Representations of Words and Phrases and their Compositionality — Mikolov et al., 2013 (Word2Vec, skip-gram + negative sampling).
- Enriching Word Vectors with Subword Information — Bojanowski et al., 2017 (fastText).
- Attention Is All You Need — Vaswani et al., 2017 — covered next under The Transformer.
What to read next
- The Transformer — the architecture that replaced bag-of-context with learned attention.