A loss function measures how wrong a model’s predictions are. During training, the model produces predictions, the loss function converts prediction error into a scalar, and the optimizer changes the parameters to reduce that scalar.
In PyTorch, a loss function usually receives two tensors:
loss = loss_fn(prediction, target)The result is usually a scalar tensor:
torch.Size([])This scalar is the objective used by backpropagation.
Loss as a Training Objective
A neural network defines a function
where denotes all trainable parameters. For input , the prediction is
A loss function compares with the target :
For a dataset of examples, the training objective is usually the average loss:
Training tries to find parameters with low loss:
In minibatch training, we approximate the full objective using a batch:
This batch loss is the scalar passed to backward().
Mean Squared Error
Mean squared error is used for regression. It compares real-valued predictions with real-valued targets:
In PyTorch:
import torch
from torch import nn
prediction = torch.tensor([2.5, 0.0, 4.0])
target = torch.tensor([3.0, -1.0, 2.0])
loss_fn = nn.MSELoss()
loss = loss_fn(prediction, target)
print(loss)MSE penalizes large errors strongly because the error is squared. This makes it useful when large deviations are especially undesirable. It can also make the model sensitive to outliers.
For vector-valued outputs:
prediction = torch.randn(32, 10)
target = torch.randn(32, 10)
loss = nn.MSELoss()(prediction, target)By default, PyTorch averages over all elements.
Mean Absolute Error
Mean absolute error uses absolute differences:
In PyTorch, this is nn.L1Loss:
loss_fn = nn.L1Loss()
loss = loss_fn(prediction, target)MAE is less sensitive to outliers than MSE. The penalty grows linearly rather than quadratically.
However, the absolute value has a kink at zero. This rarely prevents training in practice, but it gives a less smooth objective than MSE.
Huber Loss
Huber loss combines properties of MSE and MAE. Small errors are penalized quadratically. Large errors are penalized linearly.
For error , Huber loss is
In PyTorch:
loss_fn = nn.HuberLoss(delta=1.0)
loss = loss_fn(prediction, target)Huber loss is useful when regression targets contain occasional large outliers.
Binary Cross-Entropy
Binary cross-entropy is used for binary classification. The target is
The model predicts a probability
The loss is
In PyTorch, the numerically stable form is nn.BCEWithLogitsLoss. It expects logits, not probabilities.
logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()
loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)A logit is the raw output before sigmoid. During inference, we convert logits to probabilities:
probs = torch.sigmoid(logits)
preds = (probs >= 0.5).long()For multi-label classification, use one binary cross-entropy loss per label:
logits = torch.randn(32, 10)
targets = torch.randint(0, 2, (32, 10)).float()
loss = nn.BCEWithLogitsLoss()(logits, targets)Each class is treated independently.
Cross-Entropy for Multi-Class Classification
Multi-class classification assumes that each example belongs to exactly one class. If there are classes, the target is an integer:
The model outputs logits:
The softmax function converts logits into probabilities:
Cross-entropy loss is
In PyTorch, nn.CrossEntropyLoss expects raw logits and integer class labels:
logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)Do not apply softmax before CrossEntropyLoss:
# Avoid this during training
probs = torch.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probs, targets)CrossEntropyLoss already applies a stable log-softmax internally.
Negative Log-Likelihood
Negative log-likelihood is closely related to cross-entropy. It expects log-probabilities as input:
log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)The following two forms are equivalent in intent:
loss = nn.CrossEntropyLoss()(logits, targets)
log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)CrossEntropyLoss is usually preferred because it is simpler and numerically stable.
Loss Reduction
Most PyTorch loss functions support a reduction argument.
nn.MSELoss(reduction="mean")
nn.MSELoss(reduction="sum")
nn.MSELoss(reduction="none")The default is usually "mean".
With reduction="none", the loss is returned per element or per example:
loss_fn = nn.CrossEntropyLoss(reduction="none")
logits = torch.randn(4, 3)
targets = torch.tensor([0, 2, 1, 2])
losses = loss_fn(logits, targets)
print(losses.shape) # torch.Size([4])This is useful for weighted training, masking, curriculum learning, or debugging.
For example:
weights = torch.tensor([1.0, 0.5, 2.0, 1.0])
loss = (losses * weights).mean()Class Imbalance
In classification, some classes may appear much more often than others. A model trained with ordinary cross-entropy may favor frequent classes.
PyTorch allows class weights:
class_weights = torch.tensor([1.0, 3.0, 8.0])
loss_fn = nn.CrossEntropyLoss(weight=class_weights)For binary classification, BCEWithLogitsLoss supports pos_weight:
pos_weight = torch.tensor([5.0])
loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)This increases the loss contribution from positive examples.
Class weighting changes the training objective. It may improve recall for rare classes, but it can also change calibration. A model trained with weights may require careful threshold selection.
Masked Losses
Some tasks contain padding or invalid targets. Sequence models often process batches where sequences have different lengths. Shorter sequences are padded so that tensors have a rectangular shape.
Suppose token logits have shape
where is batch size, is sequence length, and is vocabulary size. Targets have shape
Padding tokens should not contribute to the loss.
PyTorch’s CrossEntropyLoss supports ignore_index:
B, T, V = 4, 8, 1000
logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))
pad_id = 0
targets[0, 5:] = pad_id
loss_fn = nn.CrossEntropyLoss(ignore_index=pad_id)
loss = loss_fn(
logits.reshape(B * T, V),
targets.reshape(B * T),
)The padding index is ignored when computing the average loss.
Choosing a Loss Function
The loss should match the statistical structure of the task.
| Task | Model output | Target | Common loss |
|---|---|---|---|
| Regression | Real values | Real values | MSELoss, L1Loss, HuberLoss |
| Binary classification | Logits | 0 or 1 floats | BCEWithLogitsLoss |
| Multi-class classification | Logits per class | Integer class IDs | CrossEntropyLoss |
| Multi-label classification | Logits per label | 0 or 1 matrix | BCEWithLogitsLoss |
| Language modeling | Token logits | Token IDs | CrossEntropyLoss |
| Segmentation | Pixel logits | Pixel class IDs | CrossEntropyLoss |
A common source of bugs is using the wrong target type or shape. CrossEntropyLoss expects integer class labels. BCEWithLogitsLoss expects floating-point binary targets.
Loss Values and Model Quality
A lower training loss means the model fits the training data better. It does not guarantee better generalization.
A model can reduce training loss while becoming worse on unseen data. This is overfitting. For this reason, training loss should be monitored together with validation loss and task metrics.
For classification, accuracy, precision, recall, F1 score, ROC-AUC, and calibration may matter more than loss alone. For regression, mean absolute error or domain-specific metrics may be easier to interpret than MSE.
The loss function is used for optimization. The evaluation metric is used for judgment. Sometimes they are the same. Often they differ.
Summary
Loss functions convert prediction errors into scalar objectives for training. Regression commonly uses MSE, MAE, or Huber loss. Binary classification uses binary cross-entropy. Multi-class classification uses softmax cross-entropy.
In PyTorch, use logits directly with BCEWithLogitsLoss and CrossEntropyLoss. This gives better numerical stability and simpler code. The selected loss must match the task, target shape, target type, and model output.