# Softmax and Output Activations

Many neural networks produce raw scores. These scores are called logits. A logit can be any real number. It may be negative, positive, small, or large. For classification, we usually need to convert logits into probabilities. Output activation functions perform this conversion.

The most important output activation for multi-class classification is softmax. It maps a vector of real-valued scores into a vector of positive values that sum to 1.

### Logits

A classification model usually ends with a linear layer:

$$
z = Wh + b.
$$

Here \(h\) is the final hidden representation, \(W\) is a weight matrix, and \(b\) is a bias vector. The output \(z\) is a vector of logits.

If the task has \(K\) classes, then

$$
z\in\mathbb{R}^K.
$$

Each component \(z_k\) is the score for class \(k\). Larger logits indicate stronger model preference, but logits are not probabilities. They do not need to be positive, and they do not need to sum to 1.

For a batch of \(B\) examples, the model usually produces

$$
Z\in\mathbb{R}^{B\times K}.
$$

In PyTorch:

```python
import torch
import torch.nn as nn

model = nn.Linear(128, 10)

h = torch.randn(32, 128)
logits = model(h)

print(logits.shape)  # torch.Size([32, 10])
```

This output represents 32 examples and 10 class scores per example.

### The Softmax Function

The softmax function takes a vector \(z\in\mathbb{R}^K\) and returns a probability vector \(p\in\mathbb{R}^K\):

$$
p_k =
\frac{\exp(z_k)}
{\sum_{j=1}^{K}\exp(z_j)}.
$$

Each output is positive:

$$
p_k > 0.
$$

The outputs sum to 1:

$$
\sum_{k=1}^{K}p_k=1.
$$

Thus \(p_k\) can be interpreted as the model’s predicted probability for class \(k\).

In PyTorch:

```python
import torch.nn.functional as F

logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)

print(probs)
print(probs.sum())
```

The largest logit receives the largest probability.

### Softmax for Batches

For a batch of logits

$$
Z\in\mathbb{R}^{B\times K},
$$

softmax should usually be applied over the class axis:

$$
P_{bk} =
\frac{\exp(Z_{bk})}
{\sum_{j=1}^{K}\exp(Z_{bj})}.
$$

Each row becomes one probability distribution over classes.

In PyTorch:

```python
logits = torch.randn(32, 10)
probs = F.softmax(logits, dim=1)

print(probs.shape)        # torch.Size([32, 10])
print(probs.sum(dim=1))   # approximately all ones
```

The argument `dim=1` means softmax is applied across the class dimension. Choosing the wrong dimension is a common error.

### Softmax as Competition

Softmax creates competition among classes. Increasing one logit increases its probability and decreases the probabilities of the others.

This differs from sigmoid. Sigmoid treats each output independently:

$$
p_k = \sigma(z_k).
$$

Softmax couples all outputs through the denominator:

$$
\sum_{j=1}^{K}\exp(z_j).
$$

This coupling makes softmax suitable when exactly one class should be selected, such as digit classification, object category classification, or single-label text classification.

For multi-label classification, where several labels can be true at once, independent sigmoid outputs are usually more appropriate.

### Numerical Stability

The naive softmax formula can overflow because exponentials grow quickly. For example,

$$
\exp(1000)
$$

cannot be represented in ordinary floating-point formats.

Softmax is usually computed with the logit maximum subtracted:

$$
p_k =
\frac{\exp(z_k - m)}
{\sum_{j=1}^{K}\exp(z_j - m)},
\quad
m=\max_j z_j.
$$

This transformation does not change the result because subtracting the same constant from every logit leaves softmax unchanged.

In practice, PyTorch implements stable softmax kernels. Still, when computing losses, one should usually avoid manually applying softmax before cross-entropy.

### Softmax Cross-Entropy

For multi-class classification, the standard loss is cross-entropy.

If the true class is \(y\), and the predicted probability for class \(y\) is \(p_y\), then the loss is

$$
L = -\log p_y.
$$

Using softmax,

$$
L =
-\log
\frac{\exp(z_y)}
{\sum_{j=1}^{K}\exp(z_j)}.
$$

This can be rewritten as

$$
L =
-z_y
+
\log
\sum_{j=1}^{K}\exp(z_j).
$$

This form is called log-softmax plus negative log likelihood. It is numerically more stable than computing softmax probabilities first.

In PyTorch, use `nn.CrossEntropyLoss` with raw logits:

```python
loss_fn = nn.CrossEntropyLoss()

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss = loss_fn(logits, targets)
print(loss)
```

Do not write:

```python
probs = F.softmax(logits, dim=1)
loss = loss_fn(probs, targets)  # wrong usage
```

`nn.CrossEntropyLoss` expects logits. It internally applies a stable log-softmax operation.

### Target Format

For `nn.CrossEntropyLoss`, targets are usually integer class indices.

If

$$
\text{logits}\in\mathbb{R}^{B\times K},
$$

then targets should have shape

$$
[B].
$$

Each target value is an integer between \(0\) and \(K-1\).

Example:

```python
logits = torch.randn(4, 3)

targets = torch.tensor([0, 2, 1, 2])

loss = nn.CrossEntropyLoss()(logits, targets)
```

Here the batch contains 4 examples and 3 classes.

For soft targets, such as label smoothing or distillation, targets may be probability distributions. In that case, the target shape is often also `[B, K]`, depending on the loss and PyTorch version.

### Binary Classification: Sigmoid Output

For binary classification, the model often produces one logit:

$$
z\in\mathbb{R}.
$$

The probability of class 1 is

$$
p = \sigma(z).
$$

The probability of class 0 is

$$
1-p.
$$

In PyTorch, the stable loss is `nn.BCEWithLogitsLoss`:

```python
loss_fn = nn.BCEWithLogitsLoss()

logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()

loss = loss_fn(logits, targets)
```

Again, do not manually apply sigmoid before this loss:

```python
probs = torch.sigmoid(logits)
loss = loss_fn(probs, targets)  # wrong usage
```

`BCEWithLogitsLoss` expects raw logits.

### Multi-Label Classification

In multi-label classification, each example may belong to multiple classes. For example, an image may contain both “person” and “bicycle.”

Here the model produces \(K\) logits:

$$
z\in\mathbb{R}^K.
$$

Each class receives an independent sigmoid:

$$
p_k=\sigma(z_k).
$$

The outputs do not need to sum to 1.

In PyTorch:

```python
logits = torch.randn(32, 20)
targets = torch.randint(0, 2, (32, 20)).float()

loss = nn.BCEWithLogitsLoss()(logits, targets)
```

Use softmax for mutually exclusive classes. Use sigmoid for independent labels.

### Log-Softmax

The log-softmax function computes the logarithm of softmax probabilities:

$$
\log p_k =
z_k -
\log\sum_{j=1}^{K}\exp(z_j).
$$

This is useful because many losses require log probabilities.

In PyTorch:

```python
log_probs = F.log_softmax(logits, dim=1)
```

For classification, `nn.CrossEntropyLoss` combines `F.log_softmax` and negative log likelihood loss.

The explicit version is:

```python
log_probs = F.log_softmax(logits, dim=1)
loss = F.nll_loss(log_probs, targets)
```

This is equivalent to:

```python
loss = F.cross_entropy(logits, targets)
```

The second version is simpler and preferred in ordinary classification code.

### Temperature Scaling

Softmax can be adjusted with a temperature parameter \(T>0\):

$$
p_k =
\frac{\exp(z_k/T)}
{\sum_{j=1}^{K}\exp(z_j/T)}.
$$

A small temperature makes the distribution sharper. A large temperature makes the distribution flatter.

If \(T<1\), the largest logit becomes more dominant. If \(T>1\), probabilities spread more evenly across classes.

Temperature is used in calibration, knowledge distillation, sampling from language models, and contrastive learning.

Example:

```python
def softmax_with_temperature(logits, temperature: float):
    return F.softmax(logits / temperature, dim=-1)

logits = torch.tensor([3.0, 1.0, 0.5])

print(softmax_with_temperature(logits, 0.5))
print(softmax_with_temperature(logits, 1.0))
print(softmax_with_temperature(logits, 2.0))
```

### Softmax in Attention

Softmax is also used inside attention mechanisms.

Given attention scores

$$
S\in\mathbb{R}^{T\times T},
$$

softmax converts scores into attention weights:

$$
A_{ij} =
\frac{\exp(S_{ij})}
{\sum_{k=1}^{T}\exp(S_{ik})}.
$$

Each row of \(A\) sums to 1. This means each token distributes its attention across other tokens.

In PyTorch:

```python
scores = torch.randn(8, 128, 128)  # batch, query positions, key positions
weights = F.softmax(scores, dim=-1)

print(weights.sum(dim=-1).shape)  # torch.Size([8, 128])
```

Softmax appears in both output layers and internal model operations.

### Masked Softmax

Attention often needs masks. For example, a decoder should not attend to future tokens. Padding tokens should also be ignored.

Masked softmax sets invalid positions to a very negative value before applying softmax:

```python
scores = torch.randn(2, 4, 4)

mask = torch.tensor([
    [True, True, False, False],
    [True, True, True, False],
])

# Expand mask to match scores shape.
mask = mask[:, None, :]

scores = scores.masked_fill(~mask, float("-inf"))
weights = F.softmax(scores, dim=-1)
```

Positions with \(-\infty\) receive probability 0 after softmax.

This pattern is central to transformer implementation.

### Argmax and Prediction

To get the predicted class from softmax probabilities, choose the class with maximum probability:

$$
\hat{y}=\arg\max_k p_k.
$$

Because softmax preserves ordering, this is equivalent to choosing the maximum logit:

$$
\hat{y}=\arg\max_k z_k.
$$

Therefore, during inference, one often avoids computing softmax when only the predicted class is needed.

In PyTorch:

```python
logits = torch.randn(32, 10)

preds = logits.argmax(dim=1)
```

This is enough for classification accuracy.

Compute softmax only when probabilities are needed.

### Calibration

Softmax probabilities are often interpreted as confidence values. For example, if a model assigns probability 0.9 to a class, we may hope it is correct about 90 percent of the time among similar predictions.

This property is called calibration.

Deep networks are often miscalibrated. A model may be overconfident or underconfident. Temperature scaling is a common post-training calibration method.

A calibrated model is especially important in medicine, finance, autonomous systems, and other high-risk settings.

### Common Mistakes

The most common mistakes are:

| Mistake | Correct approach |
|---|---|
| Applying softmax before `CrossEntropyLoss` | Pass raw logits |
| Applying sigmoid before `BCEWithLogitsLoss` | Pass raw logits |
| Using softmax for multi-label classification | Use independent sigmoid outputs |
| Applying softmax over the batch axis | Apply over the class axis |
| Treating logits as probabilities | Convert only when needed |
| Computing softmax before `argmax` | Use logits directly |

These mistakes often do not cause immediate runtime errors, but they can damage training.

### Practical Guidance

Use no activation on the final layer when training with `nn.CrossEntropyLoss`. Pass raw logits.

Use no sigmoid on the final layer when training with `nn.BCEWithLogitsLoss`. Pass raw logits.

Use `softmax` only for inspection, probability output, sampling, calibration, or attention weights.

Use `argmax` on logits when only the predicted class is needed.

For mutually exclusive classes, use softmax. For independent labels, use sigmoid.

### Exercises

1. Given logits \([2, 1, 0]\), compute the softmax probabilities.

2. Explain why subtracting the maximum logit does not change the softmax output.

3. Implement multi-class classification with `nn.CrossEntropyLoss` and verify that no softmax layer is needed.

4. Implement multi-label classification with `nn.BCEWithLogitsLoss`.

5. Compare softmax outputs with temperatures \(T=0.5\), \(T=1\), and \(T=2\).