Loss Functions
The loss is the scalar a network is trained to minimise. Choosing it correctly determines what the model learns to do — different losses on the same architecture lead to different solutions even with identical optimisers and data. This page covers the four canonical families: squared error, cross-entropy, margin losses, and ranking/contrastive losses.
Squared error: regression
Mean squared error (MSE) is the default for real-valued targets:
It corresponds to the negative log-likelihood of a Gaussian noise model — the maximum-likelihood loss when the target is assumed to be the predicted mean plus i.i.d. Gaussian noise. The clean linear gradient
Robust variants: MSE is dominated by outliers because the loss grows as the square of the residual. Mean absolute error (MAE) uses
Cross-entropy: classification
For an
Cross-entropy is the negative log-likelihood of the target under the predicted categorical distribution. Combined with a softmax output (see activations), the gradient simplifies to logsoftmax + nll_loss in PyTorch, softmax_cross_entropy_with_logits in TF) to avoid computing
Binary cross-entropy (BCE) is the binary special case, paired with sigmoid output. Multi-label classification uses BCE per output independently.
Label smoothing (Szegedy et al., 2016): replace the hard one-hot target with
Margin losses: SVM-style classification
Hinge loss is the original max-margin loss:
Hinge produces sparse gradients — zero whenever the margin is satisfied — so optimisation only touches misclassified or boundary examples. Used in classical SVMs and for some structured prediction tasks; rarely the default for deep classification, where cross-entropy almost always wins.
Ranking and contrastive losses
When the supervision is "
with
InfoNCE is the multi-class generalisation of triplet — one positive plus a batch of negatives — and lower-bounds mutual information between the matched representations.
Choosing a loss
Quick guide:
- Regression with Gaussian-ish noise — MSE.
- Regression with outliers — Huber or MAE.
- Classification (multi-class) — softmax + cross-entropy, with label smoothing for large models.
- Multi-label classification — sigmoid + BCE.
- Imbalanced classification (esp. dense prediction) — focal loss (see object detection) or Dice (see semantic segmentation).
- Metric learning / retrieval / contrastive pretraining — InfoNCE.
What to read next
- Activation Functions — softmax/sigmoid couple with cross-entropy.
- Backpropagation — what the loss gradient feeds into.
- Calibration & Uncertainty — the post-hoc assessment of how well loss-trained probabilities match reality.