From Perceptron to MLP
The multi-layer perceptron (MLP) is the simplest deep neural network: a stack of linear layers separated by element-wise non-linearities. Every architecture in this track — CNNs, RNNs, Transformers — replaces only the linear part of an MLP with something structured (convolution, attention) while keeping the same overall recipe. Understanding why an MLP can express anything, and why a single-layer perceptron cannot, is the foundational lesson.
The perceptron and what it can't compute
Rosenblatt's perceptron (1958) computes
The MLP: composition of affine + non-linearity
An
with
Universal approximation
Cybenko's Theorem (1989) and the closely related Hornik–Stinchcombe–White (1989) result state that a single hidden layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy, given enough hidden units. So the MLP family is not the bottleneck — expressivity is. What universal approximation does not tell you is how wide the layer must be (often exponentially large) or how easy the function is to learn from data. Depth, in practice, gives exponentially more efficient representations than width — this is the empirical motivation for deep, not just wide, networks.
Loss and the learning signal
Training an MLP is just minimising a loss function of
What an MLP is and isn't
An MLP is permutation-invariant to its input dimensions only if you tie weights — by default
What to read next
- Activation Functions — the non-linearity choice and its consequences.
- Backpropagation — the algorithm that makes MLP training possible.
- Loss Functions — the objective the gradient is taken of.