Skip to content

Softmax and Output Activations

Many neural networks produce raw scores. These scores are called logits.

Many neural networks produce raw scores. These scores are called logits. A logit can be any real number. It may be negative, positive, small, or large. For classification, we usually need to convert logits into probabilities. Output activation functions perform this conversion.

The most important output activation for multi-class classification is softmax. It maps a vector of real-valued scores into a vector of positive values that sum to 1.

Logits

A classification model usually ends with a linear layer:

z=Wh+b. z = Wh + b.

Here hh is the final hidden representation, WW is a weight matrix, and bb is a bias vector. The output zz is a vector of logits.

If the task has KK classes, then

zRK. z\in\mathbb{R}^K.

Each component zkz_k is the score for class kk. Larger logits indicate stronger model preference, but logits are not probabilities. They do not need to be positive, and they do not need to sum to 1.

For a batch of BB examples, the model usually produces

ZRB×K. Z\in\mathbb{R}^{B\times K}.

In PyTorch:

import torch
import torch.nn as nn

model = nn.Linear(128, 10)

h = torch.randn(32, 128)
logits = model(h)

print(logits.shape)  # torch.Size([32, 10])

This output represents 32 examples and 10 class scores per example.

The Softmax Function

The softmax function takes a vector zRKz\in\mathbb{R}^K and returns a probability vector pRKp\in\mathbb{R}^K:

pk=exp(zk)j=1Kexp(zj). p_k = \frac{\exp(z_k)} {\sum_{j=1}^{K}\exp(z_j)}.

Each output is positive:

pk>0. p_k > 0.

The outputs sum to 1:

k=1Kpk=1. \sum_{k=1}^{K}p_k=1.

Thus pkp_k can be interpreted as the model’s predicted probability for class kk.

In PyTorch:

import torch.nn.functional as F

logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)

print(probs)
print(probs.sum())

The largest logit receives the largest probability.

Softmax for Batches

For a batch of logits

ZRB×K, Z\in\mathbb{R}^{B\times K},

softmax should usually be applied over the class axis:

Pbk=exp(Zbk)j=1Kexp(Zbj). P_{bk} = \frac{\exp(Z_{bk})} {\sum_{j=1}^{K}\exp(Z_{bj})}.

Each row becomes one probability distribution over classes.

In PyTorch:

logits = torch.randn(32, 10)
probs = F.softmax(logits, dim=1)

print(probs.shape)        # torch.Size([32, 10])
print(probs.sum(dim=1))   # approximately all ones

The argument dim=1 means softmax is applied across the class dimension. Choosing the wrong dimension is a common error.

Softmax as Competition

Softmax creates competition among classes. Increasing one logit increases its probability and decreases the probabilities of the others.

This differs from sigmoid. Sigmoid treats each output independently:

pk=σ(zk). p_k = \sigma(z_k).

Softmax couples all outputs through the denominator:

j=1Kexp(zj). \sum_{j=1}^{K}\exp(z_j).

This coupling makes softmax suitable when exactly one class should be selected, such as digit classification, object category classification, or single-label text classification.

For multi-label classification, where several labels can be true at once, independent sigmoid outputs are usually more appropriate.

Numerical Stability

The naive softmax formula can overflow because exponentials grow quickly. For example,

exp(1000) \exp(1000)

cannot be represented in ordinary floating-point formats.

Softmax is usually computed with the logit maximum subtracted:

pk=exp(zkm)j=1Kexp(zjm),m=maxjzj. p_k = \frac{\exp(z_k - m)} {\sum_{j=1}^{K}\exp(z_j - m)}, \quad m=\max_j z_j.

This transformation does not change the result because subtracting the same constant from every logit leaves softmax unchanged.

In practice, PyTorch implements stable softmax kernels. Still, when computing losses, one should usually avoid manually applying softmax before cross-entropy.

Softmax Cross-Entropy

For multi-class classification, the standard loss is cross-entropy.

If the true class is yy, and the predicted probability for class yy is pyp_y, then the loss is

L=logpy. L = -\log p_y.

Using softmax,

L=logexp(zy)j=1Kexp(zj). L = -\log \frac{\exp(z_y)} {\sum_{j=1}^{K}\exp(z_j)}.

This can be rewritten as

L=zy+logj=1Kexp(zj). L = -z_y + \log \sum_{j=1}^{K}\exp(z_j).

This form is called log-softmax plus negative log likelihood. It is numerically more stable than computing softmax probabilities first.

In PyTorch, use nn.CrossEntropyLoss with raw logits:

loss_fn = nn.CrossEntropyLoss()

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss = loss_fn(logits, targets)
print(loss)

Do not write:

probs = F.softmax(logits, dim=1)
loss = loss_fn(probs, targets)  # wrong usage

nn.CrossEntropyLoss expects logits. It internally applies a stable log-softmax operation.

Target Format

For nn.CrossEntropyLoss, targets are usually integer class indices.

If

logitsRB×K, \text{logits}\in\mathbb{R}^{B\times K},

then targets should have shape

[B]. [B].

Each target value is an integer between 00 and K1K-1.

Example:

logits = torch.randn(4, 3)

targets = torch.tensor([0, 2, 1, 2])

loss = nn.CrossEntropyLoss()(logits, targets)

Here the batch contains 4 examples and 3 classes.

For soft targets, such as label smoothing or distillation, targets may be probability distributions. In that case, the target shape is often also [B, K], depending on the loss and PyTorch version.

Binary Classification: Sigmoid Output

For binary classification, the model often produces one logit:

zR. z\in\mathbb{R}.

The probability of class 1 is

p=σ(z). p = \sigma(z).

The probability of class 0 is

1p. 1-p.

In PyTorch, the stable loss is nn.BCEWithLogitsLoss:

loss_fn = nn.BCEWithLogitsLoss()

logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()

loss = loss_fn(logits, targets)

Again, do not manually apply sigmoid before this loss:

probs = torch.sigmoid(logits)
loss = loss_fn(probs, targets)  # wrong usage

BCEWithLogitsLoss expects raw logits.

Multi-Label Classification

In multi-label classification, each example may belong to multiple classes. For example, an image may contain both “person” and “bicycle.”

Here the model produces KK logits:

zRK. z\in\mathbb{R}^K.

Each class receives an independent sigmoid:

pk=σ(zk). p_k=\sigma(z_k).

The outputs do not need to sum to 1.

In PyTorch:

logits = torch.randn(32, 20)
targets = torch.randint(0, 2, (32, 20)).float()

loss = nn.BCEWithLogitsLoss()(logits, targets)

Use softmax for mutually exclusive classes. Use sigmoid for independent labels.

Log-Softmax

The log-softmax function computes the logarithm of softmax probabilities:

logpk=zklogj=1Kexp(zj). \log p_k = z_k - \log\sum_{j=1}^{K}\exp(z_j).

This is useful because many losses require log probabilities.

In PyTorch:

log_probs = F.log_softmax(logits, dim=1)

For classification, nn.CrossEntropyLoss combines F.log_softmax and negative log likelihood loss.

The explicit version is:

log_probs = F.log_softmax(logits, dim=1)
loss = F.nll_loss(log_probs, targets)

This is equivalent to:

loss = F.cross_entropy(logits, targets)

The second version is simpler and preferred in ordinary classification code.

Temperature Scaling

Softmax can be adjusted with a temperature parameter T>0T>0:

pk=exp(zk/T)j=1Kexp(zj/T). p_k = \frac{\exp(z_k/T)} {\sum_{j=1}^{K}\exp(z_j/T)}.

A small temperature makes the distribution sharper. A large temperature makes the distribution flatter.

If T<1T<1, the largest logit becomes more dominant. If T>1T>1, probabilities spread more evenly across classes.

Temperature is used in calibration, knowledge distillation, sampling from language models, and contrastive learning.

Example:

def softmax_with_temperature(logits, temperature: float):
    return F.softmax(logits / temperature, dim=-1)

logits = torch.tensor([3.0, 1.0, 0.5])

print(softmax_with_temperature(logits, 0.5))
print(softmax_with_temperature(logits, 1.0))
print(softmax_with_temperature(logits, 2.0))

Softmax in Attention

Softmax is also used inside attention mechanisms.

Given attention scores

SRT×T, S\in\mathbb{R}^{T\times T},

softmax converts scores into attention weights:

Aij=exp(Sij)k=1Texp(Sik). A_{ij} = \frac{\exp(S_{ij})} {\sum_{k=1}^{T}\exp(S_{ik})}.

Each row of AA sums to 1. This means each token distributes its attention across other tokens.

In PyTorch:

scores = torch.randn(8, 128, 128)  # batch, query positions, key positions
weights = F.softmax(scores, dim=-1)

print(weights.sum(dim=-1).shape)  # torch.Size([8, 128])

Softmax appears in both output layers and internal model operations.

Masked Softmax

Attention often needs masks. For example, a decoder should not attend to future tokens. Padding tokens should also be ignored.

Masked softmax sets invalid positions to a very negative value before applying softmax:

scores = torch.randn(2, 4, 4)

mask = torch.tensor([
    [True, True, False, False],
    [True, True, True, False],
])

# Expand mask to match scores shape.
mask = mask[:, None, :]

scores = scores.masked_fill(~mask, float("-inf"))
weights = F.softmax(scores, dim=-1)

Positions with -\infty receive probability 0 after softmax.

This pattern is central to transformer implementation.

Argmax and Prediction

To get the predicted class from softmax probabilities, choose the class with maximum probability:

y^=argmaxkpk. \hat{y}=\arg\max_k p_k.

Because softmax preserves ordering, this is equivalent to choosing the maximum logit:

y^=argmaxkzk. \hat{y}=\arg\max_k z_k.

Therefore, during inference, one often avoids computing softmax when only the predicted class is needed.

In PyTorch:

logits = torch.randn(32, 10)

preds = logits.argmax(dim=1)

This is enough for classification accuracy.

Compute softmax only when probabilities are needed.

Calibration

Softmax probabilities are often interpreted as confidence values. For example, if a model assigns probability 0.9 to a class, we may hope it is correct about 90 percent of the time among similar predictions.

This property is called calibration.

Deep networks are often miscalibrated. A model may be overconfident or underconfident. Temperature scaling is a common post-training calibration method.

A calibrated model is especially important in medicine, finance, autonomous systems, and other high-risk settings.

Common Mistakes

The most common mistakes are:

MistakeCorrect approach
Applying softmax before CrossEntropyLossPass raw logits
Applying sigmoid before BCEWithLogitsLossPass raw logits
Using softmax for multi-label classificationUse independent sigmoid outputs
Applying softmax over the batch axisApply over the class axis
Treating logits as probabilitiesConvert only when needed
Computing softmax before argmaxUse logits directly

These mistakes often do not cause immediate runtime errors, but they can damage training.

Practical Guidance

Use no activation on the final layer when training with nn.CrossEntropyLoss. Pass raw logits.

Use no sigmoid on the final layer when training with nn.BCEWithLogitsLoss. Pass raw logits.

Use softmax only for inspection, probability output, sampling, calibration, or attention weights.

Use argmax on logits when only the predicted class is needed.

For mutually exclusive classes, use softmax. For independent labels, use sigmoid.

Exercises

  1. Given logits [2,1,0][2, 1, 0], compute the softmax probabilities.

  2. Explain why subtracting the maximum logit does not change the softmax output.

  3. Implement multi-class classification with nn.CrossEntropyLoss and verify that no softmax layer is needed.

  4. Implement multi-label classification with nn.BCEWithLogitsLoss.

  5. Compare softmax outputs with temperatures T=0.5T=0.5, T=1T=1, and T=2T=2.