Cross-Entropy Loss

Cross-entropy loss is the standard loss function for classification. It measures how well a model’s predicted class distribution matches the true class label.

In regression, the target is usually a real number. In classification, the target is a class. For example, an image classifier may choose one label from

\{\text{cat}, \text{dog}, \text{car}, \text{tree}\}.

A neural network does not usually output the class directly. It outputs a vector of scores called logits. The logits are then converted into probabilities.

If there are $K$ classes, the model produces

z \in \mathbb{R}^{K},

where $z_j$ is the logit for class $j$ . The softmax function converts logits into probabilities:

p_j = \frac{\exp(z_j)} {\sum_{k=1}^{K}\exp(z_k)}.

The probabilities satisfy

p_j \geq 0, \qquad \sum_{j=1}^{K} p_j = 1.

Cross-entropy penalizes the model when it assigns low probability to the correct class.

Classification as Probability Prediction

In a $K$ -class classification problem, the model represents a conditional distribution

p_\theta(y \mid x).

Given an input $x$ , the model estimates the probability of each class. If the true class is $c$ , the ideal model assigns high probability to class $c$ :

p_\theta(y=c \mid x) \approx 1.

For one example, the cross-entropy loss is

L = -\log p_\theta(y=c \mid x).

If the model assigns probability $0.9$ to the correct class, the loss is small:

-\log(0.9) \approx 0.105.

If the model assigns probability $0.01$ to the correct class, the loss is large:

-\log(0.01) \approx 4.605.

The loss therefore rewards confidence when the model is correct and strongly penalizes confidence in the wrong classes.

One-Hot Targets

A class label can be represented as a one-hot vector. If there are $K$ classes and the correct class is $c$ , the target vector $y$ has

y_c = 1, \qquad y_j = 0 \quad \text{for } j \neq c.

For example, with four classes and correct class $2$ , using zero-based indexing:

y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}.

If the model predicts probabilities

p = \begin{bmatrix} 0.05 \\ 0.10 \\ 0.80 \\ 0.05 \end{bmatrix},

then the cross-entropy is

L = -\sum_{j=1}^{K} y_j \log p_j.

Because only the correct class has $y_j=1$ , this reduces to

L = -\log p_c.

This is why cross-entropy with one-hot labels is often described as the negative log probability of the true class.

Cross-Entropy for Batches

For a batch of $B$ examples, the logits have shape

Z \in \mathbb{R}^{B \times K}.

The target labels have shape

y \in \{0,1,\ldots,K-1\}^{B}.

Each row of $Z$ contains the logits for one example. Each entry of $y$ contains the correct class index for one example.

The batch cross-entropy loss is

L = -\frac{1}{B} \sum_{i=1}^{B} \log p_{i,y_i}.

Here $p_{i,y_i}$ is the predicted probability assigned to the correct class for example $i$ .

In PyTorch:

import torch
import torch.nn as nn

logits = torch.tensor([
    [2.0, 0.5, -1.0],
    [0.1, 1.5, 0.3],
    [-0.5, 0.2, 2.0],
])

targets = torch.tensor([0, 1, 2])

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

print(loss)

nn.CrossEntropyLoss expects raw logits, not softmax probabilities. It applies log_softmax internally and then computes negative log-likelihood.

Why PyTorch Uses Logits

A common mistake is to apply softmax before passing outputs into nn.CrossEntropyLoss:

probabilities = torch.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probabilities, targets)  # Wrong

This is numerically and mathematically wrong for the standard PyTorch loss.

The correct form is:

loss = nn.CrossEntropyLoss()(logits, targets)

PyTorch combines softmax and logarithm in a numerically stable operation. Computing softmax first can produce very small probabilities. Taking the logarithm of those values can cause numerical instability.

The stable computation uses

\log \mathrm{softmax}(z)_j = z_j - \log \sum_{k=1}^{K} \exp(z_k).

In practice, implementations subtract the maximum logit before exponentiation to avoid overflow.

Gradient of Cross-Entropy with Softmax

Cross-entropy has a particularly simple gradient when combined with softmax.

Let

p = \mathrm{softmax}(z),

and let $y$ be a one-hot target vector. The loss is

L = -\sum_{j=1}^{K} y_j \log p_j.

The gradient with respect to the logit $z_j$ is

\frac{\partial L}{\partial z_j} = p_j - y_j.

For the correct class, this gradient is $p_j - 1$ . For every incorrect class, it is $p_j$ .

This has a useful interpretation. The model increases the logit for the correct class and decreases logits for incorrect classes in proportion to their predicted probabilities.

For a batch of size $B$ , the averaged gradient is

\frac{1}{B}(p_j - y_j)

for each example and class.

Binary Cross-Entropy

For binary classification, there are two common formulations.

The first uses two logits and nn.CrossEntropyLoss:

logits = model(x)          # shape [B, 2]
targets = targets.long()   # shape [B]

loss = nn.CrossEntropyLoss()(logits, targets)

The second uses one logit and binary cross-entropy:

logits = model(x)             # shape [B]
targets = targets.float()     # shape [B]

loss = nn.BCEWithLogitsLoss()(logits, targets)

nn.BCEWithLogitsLoss combines a sigmoid function and binary cross-entropy in one numerically stable operation.

For one logit $z$ , the sigmoid probability is

p = \sigma(z) = \frac{1}{1+\exp(-z)}.

The binary cross-entropy loss is

L = - \left[ y\log p + (1-y)\log(1-p) \right].

Use BCEWithLogitsLoss for binary classification with one output logit. Use CrossEntropyLoss for multiclass classification with mutually exclusive classes.

Multiclass Versus Multilabel Classification

Multiclass classification means each example belongs to exactly one class. For example, an image may be classified as one of ten digits. The model output shape is

[B, K],

and the target shape is

[B].

Use:

loss = nn.CrossEntropyLoss()(logits, targets)

Multilabel classification means each example may belong to multiple classes at the same time. For example, an image may contain both “car” and “person.” The model output shape is

[B, K],

and the target shape is also

[B, K],

where each target entry is $0$ or $1$ .

Use:

loss = nn.BCEWithLogitsLoss()(logits, targets.float())

The difference is important. Softmax forces probabilities across classes to compete. Sigmoid treats each class independently.

Task type	Output shape	Target shape	Loss
Binary, one logit	`[B]` or `[B, 1]`	`[B]` or `[B, 1]`	`BCEWithLogitsLoss`
Multiclass	`[B, K]`	`[B]`	`CrossEntropyLoss`
Multilabel	`[B, K]`	`[B, K]`	`BCEWithLogitsLoss`

Class Imbalance

Classification datasets often have imbalanced classes. For example, in medical screening, normal cases may be much more common than positive cases. A model trained with ordinary cross-entropy may learn to favor the majority class.

PyTorch allows class weighting:

class_weights = torch.tensor([1.0, 3.0, 10.0])

loss_fn = nn.CrossEntropyLoss(weight=class_weights)
loss = loss_fn(logits, targets)

The weight for each class increases or decreases the loss contribution for examples of that class.

For binary or multilabel classification, BCEWithLogitsLoss provides pos_weight:

pos_weight = torch.tensor([5.0])
loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
loss = loss_fn(logits, targets.float())

Class weights should be used carefully. They change the optimization objective. They may improve recall for rare classes, but they can also reduce calibration or increase false positives.

Label Smoothing

Standard cross-entropy treats the target label as fully certain. For a correct class $c$ , the one-hot target assigns probability $1$ to class $c$ and $0$ to every other class.

Label smoothing softens the target distribution. Instead of

y_c = 1,

it uses

y_c = 1-\epsilon,

and distributes the remaining probability across other classes.

For $K$ classes, a common form is

y_j^{\text{smooth}} = \begin{cases} 1-\epsilon, & j=c, \\ \epsilon/(K-1), & j\neq c. \end{cases}

This discourages the model from becoming overly confident.

In PyTorch:

loss_fn = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = loss_fn(logits, targets)

Label smoothing can improve generalization and calibration, especially in large classification models. It may also reduce the maximum confidence of the model, so it should be evaluated against the needs of the application.

Cross-Entropy and Maximum Likelihood

Cross-entropy has a probabilistic interpretation. Suppose each target label is sampled from a categorical distribution predicted by the model:

y_i \sim \mathrm{Categorical}(p_\theta(\cdot \mid x_i)).

The likelihood of the observed label $y_i$ is

p_\theta(y_i \mid x_i).

The negative log-likelihood over a dataset is

-\sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Dividing by $n$ gives the mean cross-entropy loss:

L = -\frac{1}{n} \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Thus, minimizing cross-entropy is equivalent to maximizing the likelihood of the observed labels under a categorical model.

Cross-Entropy and Information Theory

Cross-entropy can also be viewed as an information-theoretic quantity. If $q$ is the true data distribution and $p$ is the model distribution, the cross-entropy is

H(q,p) = -\sum_{j=1}^{K} q_j \log p_j.

It is related to entropy and KL divergence:

H(q,p) = H(q) + D_{\mathrm{KL}}(q \| p).

Since $H(q)$ does not depend on the model, minimizing cross-entropy also minimizes the KL divergence from the true distribution to the model distribution.

In ordinary supervised classification with one-hot targets, $q$ places all probability mass on the observed class. Cross-entropy then becomes the negative log probability of that class.

Cross-Entropy for Segmentation

In image segmentation, the model predicts a class for each pixel. The logits often have shape

[B, K, H, W],

where $B$ is batch size, $K$ is number of classes, and $H,W$ are image dimensions.

The targets usually have shape

[B, H, W],

where each pixel stores a class index.

In PyTorch:

logits = torch.randn(4, 21, 256, 256)
targets = torch.randint(0, 21, (4, 256, 256))

loss = nn.CrossEntropyLoss()(logits, targets)

The same loss applies at every pixel and then averages across batch and spatial dimensions.

For binary segmentation, one may instead use BCEWithLogitsLoss with output shape [B, 1, H, W] and target shape [B, 1, H, W].

Ignore Index

Some classification targets should not contribute to the loss. This is common in segmentation and sequence modeling.

For example, padded tokens in a text batch should usually be ignored.

PyTorch provides ignore_index:

loss_fn = nn.CrossEntropyLoss(ignore_index=-100)

logits = torch.randn(8, 128, 50257)   # [B, T, V]
targets = torch.randint(0, 50257, (8, 128))

loss = loss_fn(
    logits.reshape(-1, 50257),
    targets.reshape(-1),
)

Any target equal to -100 is excluded from the loss.

For language modeling, padded positions or masked positions can be set to -100 so they do not affect training.

Cross-Entropy for Language Modeling

In autoregressive language modeling, the model predicts the next token. If the vocabulary size is $V$ , the model produces logits over the vocabulary.

For a batch of token sequences, logits may have shape

[B, T, V],

and targets may have shape

[B, T].

Each position predicts the next token. The loss is cross-entropy over vocabulary classes:

L = -\frac{1}{BT} \sum_{i=1}^{B} \sum_{t=1}^{T} \log p_\theta(x_{i,t+1} \mid x_{i,\leq t}).

In PyTorch, CrossEntropyLoss expects the class dimension before extra dimensions, or a flattened shape. A common implementation is:

B, T, V = logits.shape

loss = nn.CrossEntropyLoss()(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

This is the central training objective for GPT-style language models.

Perplexity

For language models, cross-entropy is often reported as perplexity.

If the average cross-entropy loss is measured using natural logarithms, perplexity is

\mathrm{PPL} = \exp(L).

Lower perplexity means the model assigns higher probability to the observed tokens.

For example:

loss = torch.tensor(2.0)
perplexity = torch.exp(loss)

print(perplexity)

Perplexity is easier to interpret than raw cross-entropy in some language modeling settings, but it should be compared only under the same tokenization and evaluation setup.

Common PyTorch Mistakes

The most common error is passing probabilities into nn.CrossEntropyLoss. Pass raw logits.

Another common error is using floating-point one-hot labels when CrossEntropyLoss expects integer class indices. For ordinary multiclass classification, targets should usually have dtype torch.long:

targets = targets.long()

A third error is using CrossEntropyLoss for multilabel classification. Multilabel classification should usually use BCEWithLogitsLoss.

A fourth error is putting the class dimension in the wrong place. For image classification, logits should be [B, K]. For segmentation, logits should be [B, K, H, W]. For sequence models, flattening to [B*T, V] is often the simplest approach.

Practical Guidelines

Use nn.CrossEntropyLoss for multiclass classification with mutually exclusive classes. The model should output raw logits. The targets should be integer class indices.

Use nn.BCEWithLogitsLoss for binary or multilabel classification. The model should output raw logits. The targets should be floating-point values with entries usually equal to $0$ or $1$ .

For imbalanced datasets, consider class weights, positive-class weights, resampling, or specialized losses. For large classification models, consider label smoothing. For sequence models, use ignore_index to remove padding or irrelevant positions from the loss.

Cross-entropy is the default loss for classification because it is both statistically principled and computationally convenient. It corresponds to maximum likelihood training for categorical targets, works directly with probabilistic predictions, and gives a simple gradient when combined with softmax.