Activation Functions
The activation function
Sigmoid and tanh — the saturating era
The classical activations are smooth and bounded:
Both saturate at the extremes, which makes them differentiable everywhere and gives a nice probabilistic interpretation for
ReLU — the workhorse
Rectified Linear Units Improve Restricted Boltzmann Machines (Nair, Hinton, ICML 2010) and Deep Sparse Rectifier Neural Networks (Glorot, Bordes, Bengio, AISTATS 2011) made the ReLU the default:
Its derivative is exactly 0 or 1, propagating gradients cleanly to depth and computing in one comparison + one masked write. The trade-off is dying ReLUs — units that get a large negative bias and never fire again, contributing zero gradient forever.
Variants address the dying-unit failure:
- Leaky ReLU —
with small (~0.01). - PReLU — same form with
learnable per channel. - ELU — smooth negative tail,
for .
Smooth modern activations: GELU, SiLU/Swish
Gaussian Error Linear Units (Hendrycks, Gimpel, 2016) defines GELU as
SiLU / Swish (Ramachandran, Zoph, Le, 2017),
Softmax for classification
The output activation for multi-class classification is softmax:
Softmax converts logits to a categorical distribution and pairs naturally with cross-entropy loss — the gradients of cross-entropy + softmax simplify to
Choosing an activation
Practical defaults:
- Hidden layers of CNN/MLP — ReLU; for residual blocks consider GELU at large scale.
- Hidden layers of Transformers — GELU (BERT, GPT) or SiLU (LLaMA).
- Output for classification — softmax for multi-class, sigmoid for multi-label.
- Output for regression — linear (no activation), unless you need a bounded range.
What to read next
- Backpropagation — what activation derivatives feed into.
- Weight Initialization — initialisation depends on activation choice (He init for ReLU, Xavier for tanh).
- Loss Functions — softmax + cross-entropy is one closely-coupled pair.