Softmax Regression

Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.

Suppose an input example has feature vector

x \in \mathbb{R}^d

and a label

y \in \{0,1,\ldots,K-1\}.

There are $K$ possible classes. The model predicts a probability distribution over those classes.

Class Scores

Softmax regression computes one score for each class. These scores are called logits.

For class $k$ , the logit is

z_k = w_k^\top x + b_k.

Each class has its own weight vector $w_k \in \mathbb{R}^d$ and bias $b_k \in \mathbb{R}$ .

Stacking all class weights into a matrix gives

W \in \mathbb{R}^{K \times d}.

Stacking all class biases into a vector gives

b \in \mathbb{R}^K.

The full logit vector is

z = Wx + b.

For a batch of inputs

X \in \mathbb{R}^{B \times d},

PyTorch convention usually computes

Z = XW^\top + b,

where

Z \in \mathbb{R}^{B \times K}.

Each row of $Z$ contains the class logits for one example.

The Softmax Function

The softmax function converts logits into probabilities:

p_k = \frac{\exp(z_k)} {\sum_{j=0}^{K-1}\exp(z_j)}.

The output satisfies two properties:

0 \le p_k \le 1

and

\sum_{k=0}^{K-1} p_k = 1.

Thus the output is a categorical probability distribution.

If one logit is much larger than the others, softmax assigns most of the probability to that class. If all logits are equal, softmax assigns equal probability to every class:

p_k = \frac{1}{K}.

Decision Rule

The predicted class is the class with the largest probability:

\hat{y} = \arg\max_k p_k.

Since softmax preserves order, this is equivalent to choosing the largest logit:

\hat{y} = \arg\max_k z_k.

This is why we often do not need to compute softmax during inference when we only need the class label.

In PyTorch:

logits = model(X)
preds = logits.argmax(dim=-1)

The argument dim=-1 means that the maximum is taken over the class dimension.

Cross-Entropy Loss

For one example with true class $y$ , the multiclass cross-entropy loss is

\ell(z,y) = -\log p_y.

Here $p_y$ is the probability assigned to the correct class.

Using the softmax definition:

\ell(z,y) = -\log \frac{\exp(z_y)} {\sum_{j=0}^{K-1}\exp(z_j)}.

This simplifies to

\ell(z,y) = -z_y + \log \sum_{j=0}^{K-1} \exp(z_j).

For a batch of $B$ examples, the mean loss is

L = \frac{1}{B} \sum_{i=1}^{B} \ell(z_i,y_i).

Cross-entropy rewards the model for assigning high probability to the correct class. It penalizes confident wrong predictions heavily.

Numerical Stability

The exponential function grows quickly. Directly computing

torch.exp(logits)

can overflow when logits are large.

A stable implementation subtracts the maximum logit before applying exponentials:

p_k = \frac{\exp(z_k - m)} {\sum_{j=0}^{K-1}\exp(z_j - m)},

where

m = \max_j z_j.

This does not change the softmax probabilities because adding or subtracting the same constant from every logit leaves softmax unchanged.

\operatorname{softmax}(z) = \operatorname{softmax}(z + c\mathbf{1}).

PyTorch handles this internally in nn.CrossEntropyLoss.

For this reason, training code should pass raw logits to the loss:

loss_fn = torch.nn.CrossEntropyLoss()

logits = model(X)
loss = loss_fn(logits, labels)

Do not apply softmax before CrossEntropyLoss.

Maximum Likelihood View

Softmax regression defines a categorical distribution:

P(y=k \mid x) = p_k.

For one labeled example $(x_i,y_i)$ , the probability assigned to the observed label is

P(y_i \mid x_i) = p_{i,y_i}.

The likelihood of a dataset is

\prod_{i=1}^{B} p_{i,y_i}.

Maximizing this likelihood is equivalent to minimizing the negative log-likelihood:

-\sum_{i=1}^{B} \log p_{i,y_i}.

This is exactly cross-entropy loss.

Thus softmax regression is a probabilistic linear classifier. It learns parameters that assign high probability to the observed class labels.

Gradients

Let

p = \operatorname{softmax}(z).

Let $e_y$ be the one-hot vector for the true class. For cross-entropy loss, the derivative with respect to the logits is

\frac{\partial \ell}{\partial z} = p - e_y.

This is one of the most important formulas in deep learning.

For a batch, let

P \in \mathbb{R}^{B \times K}

be the predicted probabilities and

Y \in \mathbb{R}^{B \times K}

be the one-hot label matrix. Then

\frac{\partial L}{\partial Z} = \frac{1}{B}(P - Y).

Since

Z = XW^\top + b,

the gradients are

\nabla_W L = \frac{1}{B}(P-Y)^\top X,

and

\nabla_b L = \frac{1}{B} \sum_{i=1}^{B} (P_i - Y_i).

The structure is the same as logistic regression: prediction minus target, propagated through the linear layer.

Softmax Regression in PyTorch

A softmax regression model is a single linear layer with $K$ outputs.

import torch
from torch import nn

class SoftmaxRegression(nn.Module):
    def __init__(self, in_features, num_classes):
        super().__init__()
        self.linear = nn.Linear(in_features, num_classes)

    def forward(self, x):
        return self.linear(x)

The model returns logits, not probabilities.

For a batch input with shape

[B, d]

the output has shape

[B, K]

Example training loop:

torch.manual_seed(0)

B = 256
d = 20
K = 5

X = torch.randn(B, d)
labels = torch.randint(0, K, (B,))

model = SoftmaxRegression(d, K)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(200):
    logits = model(X)
    loss = loss_fn(logits, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

The labels must have shape

[B]

and contain integer class indices:

0, 1, ..., K - 1

They should not be one-hot encoded when using nn.CrossEntropyLoss in its standard form.

Probabilities During Inference

During inference, logits can be converted to probabilities with softmax:

with torch.no_grad():
    logits = model(X)
    probs = torch.softmax(logits, dim=-1)
    preds = probs.argmax(dim=-1)

The probabilities can be useful when we need confidence estimates, ranking, thresholding, or calibration.

However, raw softmax probabilities are often overconfident. A model may assign high probability to a wrong class, especially on inputs far from the training distribution. Calibration methods are discussed later in the book.

Shape Conventions

For multiclass classification, the standard shapes are:

Quantity	Shape	Meaning
Input batch	`[B, d]`	$B$ examples, $d$ features
Logits	`[B, K]`	$K$ scores per example
Labels	`[B]`	One integer class index per example
Probabilities	`[B, K]`	Class probabilities
Predictions	`[B]`	Predicted class indices

A common mistake is passing probabilities to CrossEntropyLoss. Another common mistake is passing one-hot labels where integer labels are expected.

Correct:

logits = model(X)
loss = nn.CrossEntropyLoss()(logits, labels)

Usually incorrect:

probs = torch.softmax(model(X), dim=-1)
loss = nn.CrossEntropyLoss()(probs, labels)

The first version is numerically stable and mathematically correct for PyTorch’s built-in loss.

Linear Decision Boundaries

Softmax regression is still a linear classifier. The decision between class $a$ and class $b$ depends on which logit is larger:

z_a > z_b.

Since

z_a = w_a^\top x + b_a

and

z_b = w_b^\top x + b_b,

the boundary between these two classes is

(w_a - w_b)^\top x + (b_a - b_b) = 0.

This is a hyperplane.

Softmax regression can separate classes using linear boundaries, but it cannot learn nonlinear feature interactions unless such features are already present in the input.

Neural networks extend softmax regression by learning a nonlinear feature map before the final classifier:

h = \phi(x),

z = Wh + b.

The final layer of many classifiers is exactly softmax regression applied to learned features.

Connection to Neural Networks

Most neural classifiers end with a linear layer that produces logits:

classifier = nn.Linear(hidden_dim, num_classes)

The full network may be deep and nonlinear, but the final classification step is the same as softmax regression.

For example:

features = backbone(images)
logits = classifier(features)
loss = nn.CrossEntropyLoss()(logits, labels)

This pattern appears in image classification, text classification, speech recognition, graph classification, and many other tasks.

Softmax regression is therefore not merely a historical model. It remains the output layer of many modern deep learning systems.

Summary

Softmax regression is the multiclass extension of logistic regression. It computes one logit per class, converts logits to probabilities with softmax, and trains using cross-entropy loss.

In PyTorch, the model should return logits. nn.CrossEntropyLoss combines log-softmax and negative log-likelihood internally, so it is both stable and convenient.

Softmax regression is linear in its input, but when placed on top of learned features, it becomes the standard classifier head for deep neural networks.