Skip to content

Label Smoothing

Label smoothing is a regularization method for classification.

Label smoothing is a regularization method for classification. It replaces hard target labels with softened target distributions. Instead of telling the model that the correct class has probability 11 and every other class has probability 00, label smoothing assigns most probability mass to the correct class and a small amount to the other classes.

For a classification problem with KK classes, a one-hot target for class yy is

qk={1,k=y,0,ky. q_k = \begin{cases} 1, & k=y, \\ 0, & k\neq y. \end{cases}

With label smoothing, the target becomes

qk(ϵ)={1ϵ,k=y,ϵK1,ky. q_k^{(\epsilon)} = \begin{cases} 1-\epsilon, & k=y, \\ \frac{\epsilon}{K-1}, & k\neq y. \end{cases}

Here ϵ[0,1]\epsilon\in[0,1] is the smoothing strength.

Motivation

Standard cross-entropy training encourages the model to assign probability close to 11 to the correct class. On clean and easy datasets, this can produce highly confident predictions.

Excessive confidence can be harmful. A model may assign near-certain probability to predictions even when the input is ambiguous, mislabeled, out of distribution, or adversarially perturbed.

Label smoothing discourages this behavior. It tells the model that even the target distribution should have some uncertainty. This often improves calibration, reduces overfitting, and makes the classifier less brittle.

Cross-Entropy with Hard Labels

Let the model output logits

zRK. z\in\mathbb{R}^{K}.

The softmax probability for class kk is

pk=exp(zk)j=1Kexp(zj). p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.

For a hard target class yy, cross-entropy loss is

L=logpy. \mathcal{L} = -\log p_y.

This objective rewards increasing pyp_y as much as possible. If the training example is labeled as class yy, the optimal probability under hard-label cross-entropy is py=1p_y=1.

In finite data settings, this can push the model toward overconfident predictions.

Cross-Entropy with Smoothed Labels

With label smoothing, the target distribution is no longer one-hot. Cross-entropy becomes

L=k=1Kqk(ϵ)logpk. \mathcal{L} = -\sum_{k=1}^{K} q_k^{(\epsilon)} \log p_k.

Substituting the smoothed target gives

L=(1ϵ)logpykyϵK1logpk. \mathcal{L} = -(1-\epsilon)\log p_y - \sum_{k\neq y} \frac{\epsilon}{K-1}\log p_k.

The model is still encouraged to place high probability on the correct class, but it is also discouraged from assigning zero probability to every other class.

The result is a softer classifier.

Label Smoothing in PyTorch

PyTorch supports label smoothing directly in nn.CrossEntropyLoss.

import torch
from torch import nn

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss = criterion(logits, targets)

Here label_smoothing=0.1 uses ϵ=0.1\epsilon=0.1.

The target labels remain integer class IDs. PyTorch applies the smoothing internally.

Manual Implementation

A manual implementation helps clarify the computation.

import torch
import torch.nn.functional as F

def cross_entropy_with_label_smoothing(logits, targets, epsilon):
    num_classes = logits.size(-1)

    log_probs = F.log_softmax(logits, dim=-1)

    with torch.no_grad():
        smooth_targets = torch.full_like(
            log_probs,
            fill_value=epsilon / (num_classes - 1),
        )
        smooth_targets.scatter_(
            dim=1,
            index=targets.unsqueeze(1),
            value=1.0 - epsilon,
        )

    loss = -(smooth_targets * log_probs).sum(dim=1)
    return loss.mean()

This version constructs the smoothed target distribution explicitly.

For a batch of 3 examples and 5 classes, the targets might look like this when ϵ=0.1\epsilon=0.1:

True classSmoothed target distribution
0[0.900, 0.025, 0.025, 0.025, 0.025]
2[0.025, 0.025, 0.900, 0.025, 0.025]
4[0.025, 0.025, 0.025, 0.025, 0.900]

Uniform Smoothing Variant

Some implementations distribute ϵ\epsilon over all classes, including the correct class:

qk(ϵ)=(1ϵ)1k=y+ϵK. q_k^{(\epsilon)} = (1-\epsilon)\mathbf{1}_{k=y} + \frac{\epsilon}{K}.

Under this convention, the correct class receives

1ϵ+ϵK, 1-\epsilon+\frac{\epsilon}{K},

and each incorrect class receives

ϵK. \frac{\epsilon}{K}.

This differs slightly from the earlier formulation, where the incorrect classes share ϵ\epsilon and the correct class receives exactly 1ϵ1-\epsilon.

Both forms are common. When comparing results across libraries or papers, check the convention.

Effect on Gradients

For softmax cross-entropy, the gradient with respect to each logit is

Lzk=pkqk. \frac{\partial \mathcal{L}}{\partial z_k} = p_k - q_k.

With hard labels, qy=1q_y=1 and qk=0q_k=0 for kyk\neq y. The model receives a strong gradient pushing the correct class probability upward and all other probabilities downward.

With label smoothing, qy<1q_y<1 and qk>0q_k>0 for incorrect classes. The gradient still favors the correct class, but it no longer demands absolute certainty.

This reduces logit magnitudes. In practice, label smoothing often prevents the final-layer weights from growing too aggressively.

Label Smoothing and Calibration

Calibration measures whether predicted probabilities match empirical correctness. If a classifier says it is 90 percent confident, then among many such predictions, about 90 percent should be correct.

Hard-label training often produces overconfident models. Label smoothing can improve calibration by reducing extreme probabilities.

However, label smoothing does not guarantee perfect calibration. Temperature scaling, validation-based calibration, and uncertainty estimation may still be needed in high-stakes systems.

Label Noise

Label smoothing is useful when labels may contain noise. If some training examples are mislabeled, hard-label cross-entropy forces the model to fit those incorrect labels with high confidence.

Smoothed labels reduce this pressure. They make the training objective less brittle by allowing some uncertainty in the target.

This does not solve severe label noise. If labels are highly corrupted, dedicated methods such as noise-robust losses, relabeling, filtering, or semi-supervised learning may be needed.

Relation to Knowledge Distillation

Label smoothing is related to knowledge distillation.

In knowledge distillation, a student model is trained on soft targets produced by a teacher model. These soft targets may assign meaningful probability mass to similar classes. For example, an image of a wolf may receive some probability for dog-like classes.

Label smoothing uses a simpler target distribution. It assigns the same small probability to all incorrect classes. Thus it does not encode class similarity unless the smoothing distribution is modified.

A more general form is

q(ϵ)=(1ϵ)qhard+ϵr, q^{(\epsilon)} = (1-\epsilon)q_{\text{hard}} + \epsilon r,

where rr is a chosen smoothing distribution. Standard label smoothing uses a uniform rr. Distillation uses a teacher-derived distribution.

Label Smoothing in Multi-Class Versus Multi-Label Tasks

Label smoothing is most common in single-label multi-class classification, where each example belongs to exactly one class.

For multi-label classification, each example may belong to several classes. The usual loss is binary cross-entropy applied independently to each class. Smoothing can still be used, but the formulation differs.

For a binary target yk{0,1}y_k\in\{0,1\}, one may use

y~k=yk(1ϵ)+(1yk)ϵ. \tilde{y}_k = y_k(1-\epsilon) + (1-y_k)\epsilon.

This changes positive labels from 11 to 1ϵ1-\epsilon and negative labels from 00 to ϵ\epsilon.

In PyTorch, this is usually implemented manually with BCEWithLogitsLoss.

targets = targets.float()
epsilon = 0.05

smoothed_targets = targets * (1 - epsilon) + (1 - targets) * epsilon

loss = torch.nn.functional.binary_cross_entropy_with_logits(
    logits,
    smoothed_targets,
)

For multi-label tasks, smoothing should be used carefully because negative labels may mean “unknown” rather than truly absent.

Label Smoothing and Class Imbalance

With class imbalance, uniform smoothing may place probability mass on many rare or irrelevant classes. This can interact poorly with the learning objective.

Possible adjustments include:

StrategyDescription
Smaller ϵ\epsilonReduce smoothing strength
Class-prior smoothingSmooth toward empirical class frequencies
Hierarchical smoothingSmooth toward related classes
No smoothingPrefer weighted losses or resampling

For example, in a fine-grained taxonomy, smoothing from “Siberian husky” toward all classes uniformly may be less meaningful than smoothing toward related dog breeds.

Choosing the Smoothing Strength

Common values are:

ϵ\epsilonTypical effect
0.0No smoothing
0.05Mild smoothing
0.1Common default
0.2Strong smoothing
>0.2>0.2Often excessive

The best value depends on dataset size, label quality, number of classes, and model capacity.

If label smoothing is too strong, the model may become underconfident and accuracy may fall. If it is too weak, it may have little effect.

When Label Smoothing Helps

Label smoothing often helps when:

ConditionReason
Labels contain mild noiseReduces pressure to fit noisy labels
Model is overconfidentSoftens predicted distributions
Dataset is small or moderateReduces overfitting
Number of classes is largePrevents extreme class separation
Calibration mattersCan reduce confidence errors

It is widely used in image classification, machine translation, and transformer training.

When Label Smoothing May Hurt

Label smoothing may hurt when exact confidence is important or when the training labels are already soft and informative.

It can also interfere with methods that rely on sharp teacher signals, hard negative mining, or margin maximization.

For some retrieval and contrastive systems, label smoothing may reduce the strength of useful class separation. For knowledge distillation, teacher probabilities often provide better soft targets than uniform smoothing.

Practical Guidelines

Use nn.CrossEntropyLoss(label_smoothing=0.05) or 0.1 as a starting point for multi-class classification.

Do not apply label smoothing blindly to every task. Check whether labels are single-label, multi-label, soft, noisy, hierarchical, or class-imbalanced.

Monitor both accuracy and calibration. Label smoothing may improve calibration while slightly changing accuracy.

For very noisy labels, combine label smoothing with data cleaning, robust training, or semi-supervised methods.

Summary

Label smoothing replaces one-hot labels with softened target distributions. It regularizes classification models by discouraging extreme confidence.

In PyTorch, it can be enabled directly through nn.CrossEntropyLoss(label_smoothing=...).

Label smoothing is simple and often effective, especially for overconfident classifiers and mildly noisy labels. Its main risk is excessive underconfidence when the smoothing strength is too large.