# Label Smoothing

Label smoothing is a regularization method for classification. It replaces hard target labels with softened target distributions. Instead of telling the model that the correct class has probability $1$ and every other class has probability $0$, label smoothing assigns most probability mass to the correct class and a small amount to the other classes.

For a classification problem with $K$ classes, a one-hot target for class $y$ is

$$
q_k =
\begin{cases}
1, & k=y, \\
0, & k\neq y.
\end{cases}
$$

With label smoothing, the target becomes

$$
q_k^{(\epsilon)} =
\begin{cases}
1-\epsilon, & k=y, \\
\frac{\epsilon}{K-1}, & k\neq y.
\end{cases}
$$

Here $\epsilon\in[0,1]$ is the smoothing strength.

### Motivation

Standard cross-entropy training encourages the model to assign probability close to $1$ to the correct class. On clean and easy datasets, this can produce highly confident predictions.

Excessive confidence can be harmful. A model may assign near-certain probability to predictions even when the input is ambiguous, mislabeled, out of distribution, or adversarially perturbed.

Label smoothing discourages this behavior. It tells the model that even the target distribution should have some uncertainty. This often improves calibration, reduces overfitting, and makes the classifier less brittle.

### Cross-Entropy with Hard Labels

Let the model output logits

$$
z\in\mathbb{R}^{K}.
$$

The softmax probability for class $k$ is

$$
p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.
$$

For a hard target class $y$, cross-entropy loss is

$$
\mathcal{L} =
-\log p_y.
$$

This objective rewards increasing $p_y$ as much as possible. If the training example is labeled as class $y$, the optimal probability under hard-label cross-entropy is $p_y=1$.

In finite data settings, this can push the model toward overconfident predictions.

### Cross-Entropy with Smoothed Labels

With label smoothing, the target distribution is no longer one-hot. Cross-entropy becomes

$$
\mathcal{L} =
-\sum_{k=1}^{K} q_k^{(\epsilon)} \log p_k.
$$

Substituting the smoothed target gives

$$
\mathcal{L} =
-(1-\epsilon)\log p_y -
\sum_{k\neq y}
\frac{\epsilon}{K-1}\log p_k.
$$

The model is still encouraged to place high probability on the correct class, but it is also discouraged from assigning zero probability to every other class.

The result is a softer classifier.

### Label Smoothing in PyTorch

PyTorch supports label smoothing directly in `nn.CrossEntropyLoss`.

```python
import torch
from torch import nn

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss = criterion(logits, targets)
```

Here `label_smoothing=0.1` uses $\epsilon=0.1$.

The target labels remain integer class IDs. PyTorch applies the smoothing internally.

### Manual Implementation

A manual implementation helps clarify the computation.

```python
import torch
import torch.nn.functional as F

def cross_entropy_with_label_smoothing(logits, targets, epsilon):
    num_classes = logits.size(-1)

    log_probs = F.log_softmax(logits, dim=-1)

    with torch.no_grad():
        smooth_targets = torch.full_like(
            log_probs,
            fill_value=epsilon / (num_classes - 1),
        )
        smooth_targets.scatter_(
            dim=1,
            index=targets.unsqueeze(1),
            value=1.0 - epsilon,
        )

    loss = -(smooth_targets * log_probs).sum(dim=1)
    return loss.mean()
```

This version constructs the smoothed target distribution explicitly.

For a batch of 3 examples and 5 classes, the targets might look like this when $\epsilon=0.1$:

| True class | Smoothed target distribution |
|---:|---|
| 0 | `[0.900, 0.025, 0.025, 0.025, 0.025]` |
| 2 | `[0.025, 0.025, 0.900, 0.025, 0.025]` |
| 4 | `[0.025, 0.025, 0.025, 0.025, 0.900]` |

### Uniform Smoothing Variant

Some implementations distribute $\epsilon$ over all classes, including the correct class:

$$
q_k^{(\epsilon)} =
(1-\epsilon)\mathbf{1}_{k=y} + \frac{\epsilon}{K}.
$$

Under this convention, the correct class receives

$$
1-\epsilon+\frac{\epsilon}{K},
$$

and each incorrect class receives

$$
\frac{\epsilon}{K}.
$$

This differs slightly from the earlier formulation, where the incorrect classes share $\epsilon$ and the correct class receives exactly $1-\epsilon$.

Both forms are common. When comparing results across libraries or papers, check the convention.

### Effect on Gradients

For softmax cross-entropy, the gradient with respect to each logit is

$$
\frac{\partial \mathcal{L}}{\partial z_k} =
p_k - q_k.
$$

With hard labels, $q_y=1$ and $q_k=0$ for $k\neq y$. The model receives a strong gradient pushing the correct class probability upward and all other probabilities downward.

With label smoothing, $q_y<1$ and $q_k>0$ for incorrect classes. The gradient still favors the correct class, but it no longer demands absolute certainty.

This reduces logit magnitudes. In practice, label smoothing often prevents the final-layer weights from growing too aggressively.

### Label Smoothing and Calibration

Calibration measures whether predicted probabilities match empirical correctness. If a classifier says it is 90 percent confident, then among many such predictions, about 90 percent should be correct.

Hard-label training often produces overconfident models. Label smoothing can improve calibration by reducing extreme probabilities.

However, label smoothing does not guarantee perfect calibration. Temperature scaling, validation-based calibration, and uncertainty estimation may still be needed in high-stakes systems.

### Label Noise

Label smoothing is useful when labels may contain noise. If some training examples are mislabeled, hard-label cross-entropy forces the model to fit those incorrect labels with high confidence.

Smoothed labels reduce this pressure. They make the training objective less brittle by allowing some uncertainty in the target.

This does not solve severe label noise. If labels are highly corrupted, dedicated methods such as noise-robust losses, relabeling, filtering, or semi-supervised learning may be needed.

### Relation to Knowledge Distillation

Label smoothing is related to knowledge distillation.

In knowledge distillation, a student model is trained on soft targets produced by a teacher model. These soft targets may assign meaningful probability mass to similar classes. For example, an image of a wolf may receive some probability for dog-like classes.

Label smoothing uses a simpler target distribution. It assigns the same small probability to all incorrect classes. Thus it does not encode class similarity unless the smoothing distribution is modified.

A more general form is

$$
q^{(\epsilon)} = (1-\epsilon)q_{\text{hard}} + \epsilon r,
$$

where $r$ is a chosen smoothing distribution. Standard label smoothing uses a uniform $r$. Distillation uses a teacher-derived distribution.

### Label Smoothing in Multi-Class Versus Multi-Label Tasks

Label smoothing is most common in single-label multi-class classification, where each example belongs to exactly one class.

For multi-label classification, each example may belong to several classes. The usual loss is binary cross-entropy applied independently to each class. Smoothing can still be used, but the formulation differs.

For a binary target $y_k\in\{0,1\}$, one may use

$$
\tilde{y}_k = y_k(1-\epsilon) + (1-y_k)\epsilon.
$$

This changes positive labels from $1$ to $1-\epsilon$ and negative labels from $0$ to $\epsilon$.

In PyTorch, this is usually implemented manually with `BCEWithLogitsLoss`.

```python
targets = targets.float()
epsilon = 0.05

smoothed_targets = targets * (1 - epsilon) + (1 - targets) * epsilon

loss = torch.nn.functional.binary_cross_entropy_with_logits(
    logits,
    smoothed_targets,
)
```

For multi-label tasks, smoothing should be used carefully because negative labels may mean “unknown” rather than truly absent.

### Label Smoothing and Class Imbalance

With class imbalance, uniform smoothing may place probability mass on many rare or irrelevant classes. This can interact poorly with the learning objective.

Possible adjustments include:

| Strategy | Description |
|---|---|
| Smaller $\epsilon$ | Reduce smoothing strength |
| Class-prior smoothing | Smooth toward empirical class frequencies |
| Hierarchical smoothing | Smooth toward related classes |
| No smoothing | Prefer weighted losses or resampling |

For example, in a fine-grained taxonomy, smoothing from “Siberian husky” toward all classes uniformly may be less meaningful than smoothing toward related dog breeds.

### Choosing the Smoothing Strength

Common values are:

| $\epsilon$ | Typical effect |
|---:|---|
| 0.0 | No smoothing |
| 0.05 | Mild smoothing |
| 0.1 | Common default |
| 0.2 | Strong smoothing |
| $>0.2$ | Often excessive |

The best value depends on dataset size, label quality, number of classes, and model capacity.

If label smoothing is too strong, the model may become underconfident and accuracy may fall. If it is too weak, it may have little effect.

### When Label Smoothing Helps

Label smoothing often helps when:

| Condition | Reason |
|---|---|
| Labels contain mild noise | Reduces pressure to fit noisy labels |
| Model is overconfident | Softens predicted distributions |
| Dataset is small or moderate | Reduces overfitting |
| Number of classes is large | Prevents extreme class separation |
| Calibration matters | Can reduce confidence errors |

It is widely used in image classification, machine translation, and transformer training.

### When Label Smoothing May Hurt

Label smoothing may hurt when exact confidence is important or when the training labels are already soft and informative.

It can also interfere with methods that rely on sharp teacher signals, hard negative mining, or margin maximization.

For some retrieval and contrastive systems, label smoothing may reduce the strength of useful class separation. For knowledge distillation, teacher probabilities often provide better soft targets than uniform smoothing.

### Practical Guidelines

Use `nn.CrossEntropyLoss(label_smoothing=0.05)` or `0.1` as a starting point for multi-class classification.

Do not apply label smoothing blindly to every task. Check whether labels are single-label, multi-label, soft, noisy, hierarchical, or class-imbalanced.

Monitor both accuracy and calibration. Label smoothing may improve calibration while slightly changing accuracy.

For very noisy labels, combine label smoothing with data cleaning, robust training, or semi-supervised methods.

### Summary

Label smoothing replaces one-hot labels with softened target distributions. It regularizes classification models by discouraging extreme confidence.

In PyTorch, it can be enabled directly through `nn.CrossEntropyLoss(label_smoothing=...)`.

Label smoothing is simple and often effective, especially for overconfident classifiers and mildly noisy labels. Its main risk is excessive underconfidence when the smoothing strength is too large.