Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.
Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.
Suppose an input example has feature vector
and a label
There are possible classes. The model predicts a probability distribution over those classes.
Class Scores
Softmax regression computes one score for each class. These scores are called logits.
For class , the logit is
Each class has its own weight vector and bias .
Stacking all class weights into a matrix gives
Stacking all class biases into a vector gives
The full logit vector is
For a batch of inputs
PyTorch convention usually computes
where
Each row of contains the class logits for one example.
The Softmax Function
The softmax function converts logits into probabilities:
The output satisfies two properties:
and
Thus the output is a categorical probability distribution.
If one logit is much larger than the others, softmax assigns most of the probability to that class. If all logits are equal, softmax assigns equal probability to every class:
Decision Rule
The predicted class is the class with the largest probability:
Since softmax preserves order, this is equivalent to choosing the largest logit:
This is why we often do not need to compute softmax during inference when we only need the class label.
In PyTorch:
logits = model(X)
preds = logits.argmax(dim=-1)The argument dim=-1 means that the maximum is taken over the class dimension.
Cross-Entropy Loss
For one example with true class , the multiclass cross-entropy loss is
Here is the probability assigned to the correct class.
Using the softmax definition:
This simplifies to
For a batch of examples, the mean loss is
Cross-entropy rewards the model for assigning high probability to the correct class. It penalizes confident wrong predictions heavily.
Numerical Stability
The exponential function grows quickly. Directly computing
torch.exp(logits)can overflow when logits are large.
A stable implementation subtracts the maximum logit before applying exponentials:
where
This does not change the softmax probabilities because adding or subtracting the same constant from every logit leaves softmax unchanged.
PyTorch handles this internally in nn.CrossEntropyLoss.
For this reason, training code should pass raw logits to the loss:
loss_fn = torch.nn.CrossEntropyLoss()
logits = model(X)
loss = loss_fn(logits, labels)Do not apply softmax before CrossEntropyLoss.
Maximum Likelihood View
Softmax regression defines a categorical distribution:
For one labeled example , the probability assigned to the observed label is
The likelihood of a dataset is
Maximizing this likelihood is equivalent to minimizing the negative log-likelihood:
This is exactly cross-entropy loss.
Thus softmax regression is a probabilistic linear classifier. It learns parameters that assign high probability to the observed class labels.
Gradients
Let
Let be the one-hot vector for the true class. For cross-entropy loss, the derivative with respect to the logits is
This is one of the most important formulas in deep learning.
For a batch, let
be the predicted probabilities and
be the one-hot label matrix. Then
Since
the gradients are
and
The structure is the same as logistic regression: prediction minus target, propagated through the linear layer.
Softmax Regression in PyTorch
A softmax regression model is a single linear layer with outputs.
import torch
from torch import nn
class SoftmaxRegression(nn.Module):
def __init__(self, in_features, num_classes):
super().__init__()
self.linear = nn.Linear(in_features, num_classes)
def forward(self, x):
return self.linear(x)The model returns logits, not probabilities.
For a batch input with shape
[B, d]the output has shape
[B, K]Example training loop:
torch.manual_seed(0)
B = 256
d = 20
K = 5
X = torch.randn(B, d)
labels = torch.randint(0, K, (B,))
model = SoftmaxRegression(d, K)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(200):
logits = model(X)
loss = loss_fn(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()The labels must have shape
[B]and contain integer class indices:
0, 1, ..., K - 1They should not be one-hot encoded when using nn.CrossEntropyLoss in its standard form.
Probabilities During Inference
During inference, logits can be converted to probabilities with softmax:
with torch.no_grad():
logits = model(X)
probs = torch.softmax(logits, dim=-1)
preds = probs.argmax(dim=-1)The probabilities can be useful when we need confidence estimates, ranking, thresholding, or calibration.
However, raw softmax probabilities are often overconfident. A model may assign high probability to a wrong class, especially on inputs far from the training distribution. Calibration methods are discussed later in the book.
Shape Conventions
For multiclass classification, the standard shapes are:
| Quantity | Shape | Meaning |
|---|---|---|
| Input batch | [B, d] | examples, features |
| Logits | [B, K] | scores per example |
| Labels | [B] | One integer class index per example |
| Probabilities | [B, K] | Class probabilities |
| Predictions | [B] | Predicted class indices |
A common mistake is passing probabilities to CrossEntropyLoss. Another common mistake is passing one-hot labels where integer labels are expected.
Correct:
logits = model(X)
loss = nn.CrossEntropyLoss()(logits, labels)Usually incorrect:
probs = torch.softmax(model(X), dim=-1)
loss = nn.CrossEntropyLoss()(probs, labels)The first version is numerically stable and mathematically correct for PyTorch’s built-in loss.
Linear Decision Boundaries
Softmax regression is still a linear classifier. The decision between class and class depends on which logit is larger:
Since
and
the boundary between these two classes is
This is a hyperplane.
Softmax regression can separate classes using linear boundaries, but it cannot learn nonlinear feature interactions unless such features are already present in the input.
Neural networks extend softmax regression by learning a nonlinear feature map before the final classifier:
The final layer of many classifiers is exactly softmax regression applied to learned features.
Connection to Neural Networks
Most neural classifiers end with a linear layer that produces logits:
classifier = nn.Linear(hidden_dim, num_classes)The full network may be deep and nonlinear, but the final classification step is the same as softmax regression.
For example:
features = backbone(images)
logits = classifier(features)
loss = nn.CrossEntropyLoss()(logits, labels)This pattern appears in image classification, text classification, speech recognition, graph classification, and many other tasks.
Softmax regression is therefore not merely a historical model. It remains the output layer of many modern deep learning systems.
Summary
Softmax regression is the multiclass extension of logistic regression. It computes one logit per class, converts logits to probabilities with softmax, and trains using cross-entropy loss.
In PyTorch, the model should return logits. nn.CrossEntropyLoss combines log-softmax and negative log-likelihood internally, so it is both stable and convenient.
Softmax regression is linear in its input, but when placed on top of learned features, it becomes the standard classifier head for deep neural networks.