Loss Functions

A loss function measures how wrong a model’s predictions are. During training, the model produces predictions, the loss function converts prediction error into a scalar, and the optimizer changes the parameters to reduce that scalar.

In PyTorch, a loss function usually receives two tensors:

loss = loss_fn(prediction, target)

The result is usually a scalar tensor:

torch.Size([])

This scalar is the objective used by backpropagation.

Loss as a Training Objective

A neural network defines a function

f_\theta(x)

where $\theta$ denotes all trainable parameters. For input $x_i$ , the prediction is

\hat{y}_i = f_\theta(x_i).

A loss function compares $\hat{y}_i$ with the target $y_i$ :

\ell(\hat{y}_i, y_i).

For a dataset of $N$ examples, the training objective is usually the average loss:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

Training tries to find parameters with low loss:

\theta^\star = \arg\min_{\theta} L(\theta).

In minibatch training, we approximate the full objective using a batch:

L_B(\theta) = \frac{1}{B} \sum_{i=1}^{B} \ell(f_\theta(x_i), y_i).

This batch loss is the scalar passed to backward().

Mean Squared Error

Mean squared error is used for regression. It compares real-valued predictions with real-valued targets:

L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i-y_i)^2.

In PyTorch:

import torch
from torch import nn

prediction = torch.tensor([2.5, 0.0, 4.0])
target = torch.tensor([3.0, -1.0, 2.0])

loss_fn = nn.MSELoss()
loss = loss_fn(prediction, target)

print(loss)

MSE penalizes large errors strongly because the error is squared. This makes it useful when large deviations are especially undesirable. It can also make the model sensitive to outliers.

For vector-valued outputs:

prediction = torch.randn(32, 10)
target = torch.randn(32, 10)

loss = nn.MSELoss()(prediction, target)

By default, PyTorch averages over all elements.

Mean Absolute Error

Mean absolute error uses absolute differences:

L = \frac{1}{N} \sum_{i=1}^{N} |\hat{y}_i-y_i|.

In PyTorch, this is nn.L1Loss:

loss_fn = nn.L1Loss()
loss = loss_fn(prediction, target)

MAE is less sensitive to outliers than MSE. The penalty grows linearly rather than quadratically.

However, the absolute value has a kink at zero. This rarely prevents training in practice, but it gives a less smooth objective than MSE.

Huber Loss

Huber loss combines properties of MSE and MAE. Small errors are penalized quadratically. Large errors are penalized linearly.

For error $e = \hat{y}-y$ , Huber loss is

\ell(e) = \begin{cases} \frac{1}{2}e^2, & |e|\le \delta, \\ \delta(|e|-\frac{1}{2}\delta), & |e|>\delta. \end{cases}

In PyTorch:

loss_fn = nn.HuberLoss(delta=1.0)
loss = loss_fn(prediction, target)

Huber loss is useful when regression targets contain occasional large outliers.

Binary Cross-Entropy

Binary cross-entropy is used for binary classification. The target is

y \in \{0,1\}.

The model predicts a probability

\hat{p} \in (0,1).

The loss is

\ell(\hat{p},y) = -\left[ y\log(\hat{p}) + (1-y)\log(1-\hat{p}) \right].

In PyTorch, the numerically stable form is nn.BCEWithLogitsLoss. It expects logits, not probabilities.

logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

A logit is the raw output before sigmoid. During inference, we convert logits to probabilities:

probs = torch.sigmoid(logits)
preds = (probs >= 0.5).long()

For multi-label classification, use one binary cross-entropy loss per label:

logits = torch.randn(32, 10)
targets = torch.randint(0, 2, (32, 10)).float()

loss = nn.BCEWithLogitsLoss()(logits, targets)

Each class is treated independently.

Cross-Entropy for Multi-Class Classification

Multi-class classification assumes that each example belongs to exactly one class. If there are $K$ classes, the target is an integer:

y \in \{0,1,\ldots,K-1\}.

The model outputs logits:

z \in \mathbb{R}^{K}.

The softmax function converts logits into probabilities:

p_k = \frac{\exp(z_k)} {\sum_{j=1}^{K}\exp(z_j)}.

Cross-entropy loss is

\ell(z,y) = -\log p_y.

In PyTorch, nn.CrossEntropyLoss expects raw logits and integer class labels:

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

Do not apply softmax before CrossEntropyLoss:

# Avoid this during training
probs = torch.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probs, targets)

CrossEntropyLoss already applies a stable log-softmax internally.

Negative Log-Likelihood

Negative log-likelihood is closely related to cross-entropy. It expects log-probabilities as input:

log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)

The following two forms are equivalent in intent:

loss = nn.CrossEntropyLoss()(logits, targets)

log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)

CrossEntropyLoss is usually preferred because it is simpler and numerically stable.

Loss Reduction

Most PyTorch loss functions support a reduction argument.

nn.MSELoss(reduction="mean")
nn.MSELoss(reduction="sum")
nn.MSELoss(reduction="none")

The default is usually "mean".

With reduction="none", the loss is returned per element or per example:

loss_fn = nn.CrossEntropyLoss(reduction="none")

logits = torch.randn(4, 3)
targets = torch.tensor([0, 2, 1, 2])

losses = loss_fn(logits, targets)
print(losses.shape)  # torch.Size([4])

This is useful for weighted training, masking, curriculum learning, or debugging.

For example:

weights = torch.tensor([1.0, 0.5, 2.0, 1.0])

loss = (losses * weights).mean()

Class Imbalance

In classification, some classes may appear much more often than others. A model trained with ordinary cross-entropy may favor frequent classes.

PyTorch allows class weights:

class_weights = torch.tensor([1.0, 3.0, 8.0])

loss_fn = nn.CrossEntropyLoss(weight=class_weights)

For binary classification, BCEWithLogitsLoss supports pos_weight:

pos_weight = torch.tensor([5.0])

loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

This increases the loss contribution from positive examples.

Class weighting changes the training objective. It may improve recall for rare classes, but it can also change calibration. A model trained with weights may require careful threshold selection.

Masked Losses

Some tasks contain padding or invalid targets. Sequence models often process batches where sequences have different lengths. Shorter sequences are padded so that tensors have a rectangular shape.

Suppose token logits have shape

[B,T,V],

where $B$ is batch size, $T$ is sequence length, and $V$ is vocabulary size. Targets have shape

[B,T].

Padding tokens should not contribute to the loss.

PyTorch’s CrossEntropyLoss supports ignore_index:

B, T, V = 4, 8, 1000

logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))

pad_id = 0
targets[0, 5:] = pad_id

loss_fn = nn.CrossEntropyLoss(ignore_index=pad_id)

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The padding index is ignored when computing the average loss.

Choosing a Loss Function

The loss should match the statistical structure of the task.

Task	Model output	Target	Common loss
Regression	Real values	Real values	`MSELoss`, `L1Loss`, `HuberLoss`
Binary classification	Logits	0 or 1 floats	`BCEWithLogitsLoss`
Multi-class classification	Logits per class	Integer class IDs	`CrossEntropyLoss`
Multi-label classification	Logits per label	0 or 1 matrix	`BCEWithLogitsLoss`
Language modeling	Token logits	Token IDs	`CrossEntropyLoss`
Segmentation	Pixel logits	Pixel class IDs	`CrossEntropyLoss`

A common source of bugs is using the wrong target type or shape. CrossEntropyLoss expects integer class labels. BCEWithLogitsLoss expects floating-point binary targets.

Loss Values and Model Quality

A lower training loss means the model fits the training data better. It does not guarantee better generalization.

A model can reduce training loss while becoming worse on unseen data. This is overfitting. For this reason, training loss should be monitored together with validation loss and task metrics.

For classification, accuracy, precision, recall, F1 score, ROC-AUC, and calibration may matter more than loss alone. For regression, mean absolute error or domain-specific metrics may be easier to interpret than MSE.

The loss function is used for optimization. The evaluation metric is used for judgment. Sometimes they are the same. Often they differ.

Summary

Loss functions convert prediction errors into scalar objectives for training. Regression commonly uses MSE, MAE, or Huber loss. Binary classification uses binary cross-entropy. Multi-class classification uses softmax cross-entropy.

In PyTorch, use logits directly with BCEWithLogitsLoss and CrossEntropyLoss. This gives better numerical stability and simpler code. The selected loss must match the task, target shape, target type, and model output.