Skip to content

Loss Functions

The loss is the scalar a network is trained to minimise. Choosing it correctly determines what the model learns to do — different losses on the same architecture lead to different solutions even with identical optimisers and data. This page covers the four canonical families: squared error, cross-entropy, margin losses, and ranking/contrastive losses.

Squared error: regression

Mean squared error (MSE) is the default for real-valued targets:

LMSE(y^,y)=12(y^y)2,Ly^=y^y.

It corresponds to the negative log-likelihood of a Gaussian noise model — the maximum-likelihood loss when the target is assumed to be the predicted mean plus i.i.d. Gaussian noise. The clean linear gradient (y^y) is part of why MSE is so convenient.

Robust variants: MSE is dominated by outliers because the loss grows as the square of the residual. Mean absolute error (MAE) uses |y^y| — robust but non-differentiable at zero. Huber loss smooths the transition: quadratic for small residuals, linear for large ones, with a tunable threshold δ. Huber is the default for tasks where outliers are common (object-detection box regression, robust regression).

Cross-entropy: classification

For an N-way classification problem with one-hot target y and predicted probability vector p^,

LCE(p^,y)=iyilogp^i.

Cross-entropy is the negative log-likelihood of the target under the predicted categorical distribution. Combined with a softmax output (see activations), the gradient simplifies to p^y — the prediction error, no Jacobian explosion. Numerically stable implementations combine softmax and cross-entropy into one fused operation (logsoftmax + nll_loss in PyTorch, softmax_cross_entropy_with_logits in TF) to avoid computing log(exp(z)/exp(z)) directly.

Binary cross-entropy (BCE) is the binary special case, paired with sigmoid output. Multi-label classification uses BCE per output independently.

Label smoothing (Szegedy et al., 2016): replace the hard one-hot target with y~i=(1ϵ)yi+ϵ/N. This regularises the model against over-confident predictions and improves calibration; standard at ϵ=0.1 in modern training recipes.

Margin losses: SVM-style classification

Hinge loss is the original max-margin loss:

Lhinge(y^,y)=max(0,1yy^),y{1,+1}.

Hinge produces sparse gradients — zero whenever the margin is satisfied — so optimisation only touches misclassified or boundary examples. Used in classical SVMs and for some structured prediction tasks; rarely the default for deep classification, where cross-entropy almost always wins.

Ranking and contrastive losses

When the supervision is "a is more similar to b than to c", standard classification losses don't apply. Triplet loss (FaceNet, Schroff et al., 2015):

Ltriplet=max(0,f(a)f(b)2f(a)f(c)2+α),

with α a margin. The harder generalisation is InfoNCE (van den Oord et al., 2018), used in SimCLR, CLIP, and contrastive RAG retriever training:

LInfoNCE=logexp(f(a),f(b)/τ)kexp(f(a),f(ck)/τ).

InfoNCE is the multi-class generalisation of triplet — one positive plus a batch of negatives — and lower-bounds mutual information between the matched representations.

Choosing a loss

Quick guide:

  • Regression with Gaussian-ish noise — MSE.
  • Regression with outliers — Huber or MAE.
  • Classification (multi-class) — softmax + cross-entropy, with label smoothing for large models.
  • Multi-label classification — sigmoid + BCE.
  • Imbalanced classification (esp. dense prediction) — focal loss (see object detection) or Dice (see semantic segmentation).
  • Metric learning / retrieval / contrastive pretraining — InfoNCE.

Released under the MIT License. Content imported and adapted from NoteNextra.