Label smoothing is a regularization method for classification. It replaces hard target labels with softened target distributions. Instead of telling the model that the correct class has probability and every other class has probability , label smoothing assigns most probability mass to the correct class and a small amount to the other classes.
For a classification problem with classes, a one-hot target for class is
With label smoothing, the target becomes
Here is the smoothing strength.
Motivation
Standard cross-entropy training encourages the model to assign probability close to to the correct class. On clean and easy datasets, this can produce highly confident predictions.
Excessive confidence can be harmful. A model may assign near-certain probability to predictions even when the input is ambiguous, mislabeled, out of distribution, or adversarially perturbed.
Label smoothing discourages this behavior. It tells the model that even the target distribution should have some uncertainty. This often improves calibration, reduces overfitting, and makes the classifier less brittle.
Cross-Entropy with Hard Labels
Let the model output logits
The softmax probability for class is
For a hard target class , cross-entropy loss is
This objective rewards increasing as much as possible. If the training example is labeled as class , the optimal probability under hard-label cross-entropy is .
In finite data settings, this can push the model toward overconfident predictions.
Cross-Entropy with Smoothed Labels
With label smoothing, the target distribution is no longer one-hot. Cross-entropy becomes
Substituting the smoothed target gives
The model is still encouraged to place high probability on the correct class, but it is also discouraged from assigning zero probability to every other class.
The result is a softer classifier.
Label Smoothing in PyTorch
PyTorch supports label smoothing directly in nn.CrossEntropyLoss.
import torch
from torch import nn
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))
loss = criterion(logits, targets)Here label_smoothing=0.1 uses .
The target labels remain integer class IDs. PyTorch applies the smoothing internally.
Manual Implementation
A manual implementation helps clarify the computation.
import torch
import torch.nn.functional as F
def cross_entropy_with_label_smoothing(logits, targets, epsilon):
num_classes = logits.size(-1)
log_probs = F.log_softmax(logits, dim=-1)
with torch.no_grad():
smooth_targets = torch.full_like(
log_probs,
fill_value=epsilon / (num_classes - 1),
)
smooth_targets.scatter_(
dim=1,
index=targets.unsqueeze(1),
value=1.0 - epsilon,
)
loss = -(smooth_targets * log_probs).sum(dim=1)
return loss.mean()This version constructs the smoothed target distribution explicitly.
For a batch of 3 examples and 5 classes, the targets might look like this when :
| True class | Smoothed target distribution |
|---|---|
| 0 | [0.900, 0.025, 0.025, 0.025, 0.025] |
| 2 | [0.025, 0.025, 0.900, 0.025, 0.025] |
| 4 | [0.025, 0.025, 0.025, 0.025, 0.900] |
Uniform Smoothing Variant
Some implementations distribute over all classes, including the correct class:
Under this convention, the correct class receives
and each incorrect class receives
This differs slightly from the earlier formulation, where the incorrect classes share and the correct class receives exactly .
Both forms are common. When comparing results across libraries or papers, check the convention.
Effect on Gradients
For softmax cross-entropy, the gradient with respect to each logit is
With hard labels, and for . The model receives a strong gradient pushing the correct class probability upward and all other probabilities downward.
With label smoothing, and for incorrect classes. The gradient still favors the correct class, but it no longer demands absolute certainty.
This reduces logit magnitudes. In practice, label smoothing often prevents the final-layer weights from growing too aggressively.
Label Smoothing and Calibration
Calibration measures whether predicted probabilities match empirical correctness. If a classifier says it is 90 percent confident, then among many such predictions, about 90 percent should be correct.
Hard-label training often produces overconfident models. Label smoothing can improve calibration by reducing extreme probabilities.
However, label smoothing does not guarantee perfect calibration. Temperature scaling, validation-based calibration, and uncertainty estimation may still be needed in high-stakes systems.
Label Noise
Label smoothing is useful when labels may contain noise. If some training examples are mislabeled, hard-label cross-entropy forces the model to fit those incorrect labels with high confidence.
Smoothed labels reduce this pressure. They make the training objective less brittle by allowing some uncertainty in the target.
This does not solve severe label noise. If labels are highly corrupted, dedicated methods such as noise-robust losses, relabeling, filtering, or semi-supervised learning may be needed.
Relation to Knowledge Distillation
Label smoothing is related to knowledge distillation.
In knowledge distillation, a student model is trained on soft targets produced by a teacher model. These soft targets may assign meaningful probability mass to similar classes. For example, an image of a wolf may receive some probability for dog-like classes.
Label smoothing uses a simpler target distribution. It assigns the same small probability to all incorrect classes. Thus it does not encode class similarity unless the smoothing distribution is modified.
A more general form is
where is a chosen smoothing distribution. Standard label smoothing uses a uniform . Distillation uses a teacher-derived distribution.
Label Smoothing in Multi-Class Versus Multi-Label Tasks
Label smoothing is most common in single-label multi-class classification, where each example belongs to exactly one class.
For multi-label classification, each example may belong to several classes. The usual loss is binary cross-entropy applied independently to each class. Smoothing can still be used, but the formulation differs.
For a binary target , one may use
This changes positive labels from to and negative labels from to .
In PyTorch, this is usually implemented manually with BCEWithLogitsLoss.
targets = targets.float()
epsilon = 0.05
smoothed_targets = targets * (1 - epsilon) + (1 - targets) * epsilon
loss = torch.nn.functional.binary_cross_entropy_with_logits(
logits,
smoothed_targets,
)For multi-label tasks, smoothing should be used carefully because negative labels may mean “unknown” rather than truly absent.
Label Smoothing and Class Imbalance
With class imbalance, uniform smoothing may place probability mass on many rare or irrelevant classes. This can interact poorly with the learning objective.
Possible adjustments include:
| Strategy | Description |
|---|---|
| Smaller | Reduce smoothing strength |
| Class-prior smoothing | Smooth toward empirical class frequencies |
| Hierarchical smoothing | Smooth toward related classes |
| No smoothing | Prefer weighted losses or resampling |
For example, in a fine-grained taxonomy, smoothing from “Siberian husky” toward all classes uniformly may be less meaningful than smoothing toward related dog breeds.
Choosing the Smoothing Strength
Common values are:
| Typical effect | |
|---|---|
| 0.0 | No smoothing |
| 0.05 | Mild smoothing |
| 0.1 | Common default |
| 0.2 | Strong smoothing |
| Often excessive |
The best value depends on dataset size, label quality, number of classes, and model capacity.
If label smoothing is too strong, the model may become underconfident and accuracy may fall. If it is too weak, it may have little effect.
When Label Smoothing Helps
Label smoothing often helps when:
| Condition | Reason |
|---|---|
| Labels contain mild noise | Reduces pressure to fit noisy labels |
| Model is overconfident | Softens predicted distributions |
| Dataset is small or moderate | Reduces overfitting |
| Number of classes is large | Prevents extreme class separation |
| Calibration matters | Can reduce confidence errors |
It is widely used in image classification, machine translation, and transformer training.
When Label Smoothing May Hurt
Label smoothing may hurt when exact confidence is important or when the training labels are already soft and informative.
It can also interfere with methods that rely on sharp teacher signals, hard negative mining, or margin maximization.
For some retrieval and contrastive systems, label smoothing may reduce the strength of useful class separation. For knowledge distillation, teacher probabilities often provide better soft targets than uniform smoothing.
Practical Guidelines
Use nn.CrossEntropyLoss(label_smoothing=0.05) or 0.1 as a starting point for multi-class classification.
Do not apply label smoothing blindly to every task. Check whether labels are single-label, multi-label, soft, noisy, hierarchical, or class-imbalanced.
Monitor both accuracy and calibration. Label smoothing may improve calibration while slightly changing accuracy.
For very noisy labels, combine label smoothing with data cleaning, robust training, or semi-supervised methods.
Summary
Label smoothing replaces one-hot labels with softened target distributions. It regularizes classification models by discouraging extreme confidence.
In PyTorch, it can be enabled directly through nn.CrossEntropyLoss(label_smoothing=...).
Label smoothing is simple and often effective, especially for overconfident classifiers and mildly noisy labels. Its main risk is excessive underconfidence when the smoothing strength is too large.