Many neural networks produce raw scores. These scores are called logits.
Many neural networks produce raw scores. These scores are called logits. A logit can be any real number. It may be negative, positive, small, or large. For classification, we usually need to convert logits into probabilities. Output activation functions perform this conversion.
The most important output activation for multi-class classification is softmax. It maps a vector of real-valued scores into a vector of positive values that sum to 1.
Logits
A classification model usually ends with a linear layer:
Here is the final hidden representation, is a weight matrix, and is a bias vector. The output is a vector of logits.
If the task has classes, then
Each component is the score for class . Larger logits indicate stronger model preference, but logits are not probabilities. They do not need to be positive, and they do not need to sum to 1.
For a batch of examples, the model usually produces
In PyTorch:
import torch
import torch.nn as nn
model = nn.Linear(128, 10)
h = torch.randn(32, 128)
logits = model(h)
print(logits.shape) # torch.Size([32, 10])This output represents 32 examples and 10 class scores per example.
The Softmax Function
The softmax function takes a vector and returns a probability vector :
Each output is positive:
The outputs sum to 1:
Thus can be interpreted as the model’s predicted probability for class .
In PyTorch:
import torch.nn.functional as F
logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)
print(probs)
print(probs.sum())The largest logit receives the largest probability.
Softmax for Batches
For a batch of logits
softmax should usually be applied over the class axis:
Each row becomes one probability distribution over classes.
In PyTorch:
logits = torch.randn(32, 10)
probs = F.softmax(logits, dim=1)
print(probs.shape) # torch.Size([32, 10])
print(probs.sum(dim=1)) # approximately all onesThe argument dim=1 means softmax is applied across the class dimension. Choosing the wrong dimension is a common error.
Softmax as Competition
Softmax creates competition among classes. Increasing one logit increases its probability and decreases the probabilities of the others.
This differs from sigmoid. Sigmoid treats each output independently:
Softmax couples all outputs through the denominator:
This coupling makes softmax suitable when exactly one class should be selected, such as digit classification, object category classification, or single-label text classification.
For multi-label classification, where several labels can be true at once, independent sigmoid outputs are usually more appropriate.
Numerical Stability
The naive softmax formula can overflow because exponentials grow quickly. For example,
cannot be represented in ordinary floating-point formats.
Softmax is usually computed with the logit maximum subtracted:
This transformation does not change the result because subtracting the same constant from every logit leaves softmax unchanged.
In practice, PyTorch implements stable softmax kernels. Still, when computing losses, one should usually avoid manually applying softmax before cross-entropy.
Softmax Cross-Entropy
For multi-class classification, the standard loss is cross-entropy.
If the true class is , and the predicted probability for class is , then the loss is
Using softmax,
This can be rewritten as
This form is called log-softmax plus negative log likelihood. It is numerically more stable than computing softmax probabilities first.
In PyTorch, use nn.CrossEntropyLoss with raw logits:
loss_fn = nn.CrossEntropyLoss()
logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))
loss = loss_fn(logits, targets)
print(loss)Do not write:
probs = F.softmax(logits, dim=1)
loss = loss_fn(probs, targets) # wrong usagenn.CrossEntropyLoss expects logits. It internally applies a stable log-softmax operation.
Target Format
For nn.CrossEntropyLoss, targets are usually integer class indices.
If
then targets should have shape
Each target value is an integer between and .
Example:
logits = torch.randn(4, 3)
targets = torch.tensor([0, 2, 1, 2])
loss = nn.CrossEntropyLoss()(logits, targets)Here the batch contains 4 examples and 3 classes.
For soft targets, such as label smoothing or distillation, targets may be probability distributions. In that case, the target shape is often also [B, K], depending on the loss and PyTorch version.
Binary Classification: Sigmoid Output
For binary classification, the model often produces one logit:
The probability of class 1 is
The probability of class 0 is
In PyTorch, the stable loss is nn.BCEWithLogitsLoss:
loss_fn = nn.BCEWithLogitsLoss()
logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()
loss = loss_fn(logits, targets)Again, do not manually apply sigmoid before this loss:
probs = torch.sigmoid(logits)
loss = loss_fn(probs, targets) # wrong usageBCEWithLogitsLoss expects raw logits.
Multi-Label Classification
In multi-label classification, each example may belong to multiple classes. For example, an image may contain both “person” and “bicycle.”
Here the model produces logits:
Each class receives an independent sigmoid:
The outputs do not need to sum to 1.
In PyTorch:
logits = torch.randn(32, 20)
targets = torch.randint(0, 2, (32, 20)).float()
loss = nn.BCEWithLogitsLoss()(logits, targets)Use softmax for mutually exclusive classes. Use sigmoid for independent labels.
Log-Softmax
The log-softmax function computes the logarithm of softmax probabilities:
This is useful because many losses require log probabilities.
In PyTorch:
log_probs = F.log_softmax(logits, dim=1)For classification, nn.CrossEntropyLoss combines F.log_softmax and negative log likelihood loss.
The explicit version is:
log_probs = F.log_softmax(logits, dim=1)
loss = F.nll_loss(log_probs, targets)This is equivalent to:
loss = F.cross_entropy(logits, targets)The second version is simpler and preferred in ordinary classification code.
Temperature Scaling
Softmax can be adjusted with a temperature parameter :
A small temperature makes the distribution sharper. A large temperature makes the distribution flatter.
If , the largest logit becomes more dominant. If , probabilities spread more evenly across classes.
Temperature is used in calibration, knowledge distillation, sampling from language models, and contrastive learning.
Example:
def softmax_with_temperature(logits, temperature: float):
return F.softmax(logits / temperature, dim=-1)
logits = torch.tensor([3.0, 1.0, 0.5])
print(softmax_with_temperature(logits, 0.5))
print(softmax_with_temperature(logits, 1.0))
print(softmax_with_temperature(logits, 2.0))Softmax in Attention
Softmax is also used inside attention mechanisms.
Given attention scores
softmax converts scores into attention weights:
Each row of sums to 1. This means each token distributes its attention across other tokens.
In PyTorch:
scores = torch.randn(8, 128, 128) # batch, query positions, key positions
weights = F.softmax(scores, dim=-1)
print(weights.sum(dim=-1).shape) # torch.Size([8, 128])Softmax appears in both output layers and internal model operations.
Masked Softmax
Attention often needs masks. For example, a decoder should not attend to future tokens. Padding tokens should also be ignored.
Masked softmax sets invalid positions to a very negative value before applying softmax:
scores = torch.randn(2, 4, 4)
mask = torch.tensor([
[True, True, False, False],
[True, True, True, False],
])
# Expand mask to match scores shape.
mask = mask[:, None, :]
scores = scores.masked_fill(~mask, float("-inf"))
weights = F.softmax(scores, dim=-1)Positions with receive probability 0 after softmax.
This pattern is central to transformer implementation.
Argmax and Prediction
To get the predicted class from softmax probabilities, choose the class with maximum probability:
Because softmax preserves ordering, this is equivalent to choosing the maximum logit:
Therefore, during inference, one often avoids computing softmax when only the predicted class is needed.
In PyTorch:
logits = torch.randn(32, 10)
preds = logits.argmax(dim=1)This is enough for classification accuracy.
Compute softmax only when probabilities are needed.
Calibration
Softmax probabilities are often interpreted as confidence values. For example, if a model assigns probability 0.9 to a class, we may hope it is correct about 90 percent of the time among similar predictions.
This property is called calibration.
Deep networks are often miscalibrated. A model may be overconfident or underconfident. Temperature scaling is a common post-training calibration method.
A calibrated model is especially important in medicine, finance, autonomous systems, and other high-risk settings.
Common Mistakes
The most common mistakes are:
| Mistake | Correct approach |
|---|---|
Applying softmax before CrossEntropyLoss | Pass raw logits |
Applying sigmoid before BCEWithLogitsLoss | Pass raw logits |
| Using softmax for multi-label classification | Use independent sigmoid outputs |
| Applying softmax over the batch axis | Apply over the class axis |
| Treating logits as probabilities | Convert only when needed |
Computing softmax before argmax | Use logits directly |
These mistakes often do not cause immediate runtime errors, but they can damage training.
Practical Guidance
Use no activation on the final layer when training with nn.CrossEntropyLoss. Pass raw logits.
Use no sigmoid on the final layer when training with nn.BCEWithLogitsLoss. Pass raw logits.
Use softmax only for inspection, probability output, sampling, calibration, or attention weights.
Use argmax on logits when only the predicted class is needed.
For mutually exclusive classes, use softmax. For independent labels, use sigmoid.
Exercises
Given logits , compute the softmax probabilities.
Explain why subtracting the maximum logit does not change the softmax output.
Implement multi-class classification with
nn.CrossEntropyLossand verify that no softmax layer is needed.Implement multi-label classification with
nn.BCEWithLogitsLoss.Compare softmax outputs with temperatures , , and .