# Loss Functions

A loss function measures how wrong a model’s predictions are. During training, the model produces predictions, the loss function converts prediction error into a scalar, and the optimizer changes the parameters to reduce that scalar.

In PyTorch, a loss function usually receives two tensors:

```python
loss = loss_fn(prediction, target)
```

The result is usually a scalar tensor:

```python
torch.Size([])
```

This scalar is the objective used by backpropagation.

### Loss as a Training Objective

A neural network defines a function

$$
f_\theta(x)
$$

where \(\theta\) denotes all trainable parameters. For input \(x_i\), the prediction is

$$
\hat{y}_i = f_\theta(x_i).
$$

A loss function compares \(\hat{y}_i\) with the target \(y_i\):

$$
\ell(\hat{y}_i, y_i).
$$

For a dataset of \(N\) examples, the training objective is usually the average loss:

$$
L(\theta) =
\frac{1}{N}
\sum_{i=1}^{N}
\ell(f_\theta(x_i), y_i).
$$

Training tries to find parameters with low loss:

$$
\theta^\star =
\arg\min_{\theta} L(\theta).
$$

In minibatch training, we approximate the full objective using a batch:

$$
L_B(\theta) =
\frac{1}{B}
\sum_{i=1}^{B}
\ell(f_\theta(x_i), y_i).
$$

This batch loss is the scalar passed to `backward()`.

### Mean Squared Error

Mean squared error is used for regression. It compares real-valued predictions with real-valued targets:

$$
L =
\frac{1}{N}
\sum_{i=1}^{N}
(\hat{y}_i-y_i)^2.
$$

In PyTorch:

```python
import torch
from torch import nn

prediction = torch.tensor([2.5, 0.0, 4.0])
target = torch.tensor([3.0, -1.0, 2.0])

loss_fn = nn.MSELoss()
loss = loss_fn(prediction, target)

print(loss)
```

MSE penalizes large errors strongly because the error is squared. This makes it useful when large deviations are especially undesirable. It can also make the model sensitive to outliers.

For vector-valued outputs:

```python
prediction = torch.randn(32, 10)
target = torch.randn(32, 10)

loss = nn.MSELoss()(prediction, target)
```

By default, PyTorch averages over all elements.

### Mean Absolute Error

Mean absolute error uses absolute differences:

$$
L =
\frac{1}{N}
\sum_{i=1}^{N}
|\hat{y}_i-y_i|.
$$

In PyTorch, this is `nn.L1Loss`:

```python
loss_fn = nn.L1Loss()
loss = loss_fn(prediction, target)
```

MAE is less sensitive to outliers than MSE. The penalty grows linearly rather than quadratically.

However, the absolute value has a kink at zero. This rarely prevents training in practice, but it gives a less smooth objective than MSE.

### Huber Loss

Huber loss combines properties of MSE and MAE. Small errors are penalized quadratically. Large errors are penalized linearly.

For error \(e = \hat{y}-y\), Huber loss is

$$
\ell(e) =
\begin{cases}
\frac{1}{2}e^2, & |e|\le \delta, \\
\delta(|e|-\frac{1}{2}\delta), & |e|>\delta.
\end{cases}
$$

In PyTorch:

```python
loss_fn = nn.HuberLoss(delta=1.0)
loss = loss_fn(prediction, target)
```

Huber loss is useful when regression targets contain occasional large outliers.

### Binary Cross-Entropy

Binary cross-entropy is used for binary classification. The target is

$$
y \in \{0,1\}.
$$

The model predicts a probability

$$
\hat{p} \in (0,1).
$$

The loss is

$$
\ell(\hat{p},y) =
-\left[
y\log(\hat{p})
+
(1-y)\log(1-\hat{p})
\right].
$$

In PyTorch, the numerically stable form is `nn.BCEWithLogitsLoss`. It expects logits, not probabilities.

```python
logits = torch.randn(32)
targets = torch.randint(0, 2, (32,)).float()

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)
```

A logit is the raw output before sigmoid. During inference, we convert logits to probabilities:

```python
probs = torch.sigmoid(logits)
preds = (probs >= 0.5).long()
```

For multi-label classification, use one binary cross-entropy loss per label:

```python
logits = torch.randn(32, 10)
targets = torch.randint(0, 2, (32, 10)).float()

loss = nn.BCEWithLogitsLoss()(logits, targets)
```

Each class is treated independently.

### Cross-Entropy for Multi-Class Classification

Multi-class classification assumes that each example belongs to exactly one class. If there are \(K\) classes, the target is an integer:

$$
y \in \{0,1,\ldots,K-1\}.
$$

The model outputs logits:

$$
z \in \mathbb{R}^{K}.
$$

The softmax function converts logits into probabilities:

$$
p_k =
\frac{\exp(z_k)}
{\sum_{j=1}^{K}\exp(z_j)}.
$$

Cross-entropy loss is

$$
\ell(z,y) =
-\log p_y.
$$

In PyTorch, `nn.CrossEntropyLoss` expects raw logits and integer class labels:

```python
logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)
```

Do not apply softmax before `CrossEntropyLoss`:

```python
# Avoid this during training
probs = torch.softmax(logits, dim=-1)
loss = nn.CrossEntropyLoss()(probs, targets)
```

`CrossEntropyLoss` already applies a stable log-softmax internally.

### Negative Log-Likelihood

Negative log-likelihood is closely related to cross-entropy. It expects log-probabilities as input:

```python
log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)
```

The following two forms are equivalent in intent:

```python
loss = nn.CrossEntropyLoss()(logits, targets)

log_probs = torch.log_softmax(logits, dim=-1)
loss = nn.NLLLoss()(log_probs, targets)
```

`CrossEntropyLoss` is usually preferred because it is simpler and numerically stable.

### Loss Reduction

Most PyTorch loss functions support a `reduction` argument.

```python
nn.MSELoss(reduction="mean")
nn.MSELoss(reduction="sum")
nn.MSELoss(reduction="none")
```

The default is usually `"mean"`.

With `reduction="none"`, the loss is returned per element or per example:

```python
loss_fn = nn.CrossEntropyLoss(reduction="none")

logits = torch.randn(4, 3)
targets = torch.tensor([0, 2, 1, 2])

losses = loss_fn(logits, targets)
print(losses.shape)  # torch.Size([4])
```

This is useful for weighted training, masking, curriculum learning, or debugging.

For example:

```python
weights = torch.tensor([1.0, 0.5, 2.0, 1.0])

loss = (losses * weights).mean()
```

### Class Imbalance

In classification, some classes may appear much more often than others. A model trained with ordinary cross-entropy may favor frequent classes.

PyTorch allows class weights:

```python
class_weights = torch.tensor([1.0, 3.0, 8.0])

loss_fn = nn.CrossEntropyLoss(weight=class_weights)
```

For binary classification, `BCEWithLogitsLoss` supports `pos_weight`:

```python
pos_weight = torch.tensor([5.0])

loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
```

This increases the loss contribution from positive examples.

Class weighting changes the training objective. It may improve recall for rare classes, but it can also change calibration. A model trained with weights may require careful threshold selection.

### Masked Losses

Some tasks contain padding or invalid targets. Sequence models often process batches where sequences have different lengths. Shorter sequences are padded so that tensors have a rectangular shape.

Suppose token logits have shape

$$
[B,T,V],
$$

where \(B\) is batch size, \(T\) is sequence length, and \(V\) is vocabulary size. Targets have shape

$$
[B,T].
$$

Padding tokens should not contribute to the loss.

PyTorch’s `CrossEntropyLoss` supports `ignore_index`:

```python
B, T, V = 4, 8, 1000

logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))

pad_id = 0
targets[0, 5:] = pad_id

loss_fn = nn.CrossEntropyLoss(ignore_index=pad_id)

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)
```

The padding index is ignored when computing the average loss.

### Choosing a Loss Function

The loss should match the statistical structure of the task.

| Task | Model output | Target | Common loss |
|---|---|---|---|
| Regression | Real values | Real values | `MSELoss`, `L1Loss`, `HuberLoss` |
| Binary classification | Logits | 0 or 1 floats | `BCEWithLogitsLoss` |
| Multi-class classification | Logits per class | Integer class IDs | `CrossEntropyLoss` |
| Multi-label classification | Logits per label | 0 or 1 matrix | `BCEWithLogitsLoss` |
| Language modeling | Token logits | Token IDs | `CrossEntropyLoss` |
| Segmentation | Pixel logits | Pixel class IDs | `CrossEntropyLoss` |

A common source of bugs is using the wrong target type or shape. `CrossEntropyLoss` expects integer class labels. `BCEWithLogitsLoss` expects floating-point binary targets.

### Loss Values and Model Quality

A lower training loss means the model fits the training data better. It does not guarantee better generalization.

A model can reduce training loss while becoming worse on unseen data. This is overfitting. For this reason, training loss should be monitored together with validation loss and task metrics.

For classification, accuracy, precision, recall, F1 score, ROC-AUC, and calibration may matter more than loss alone. For regression, mean absolute error or domain-specific metrics may be easier to interpret than MSE.

The loss function is used for optimization. The evaluation metric is used for judgment. Sometimes they are the same. Often they differ.

### Summary

Loss functions convert prediction errors into scalar objectives for training. Regression commonly uses MSE, MAE, or Huber loss. Binary classification uses binary cross-entropy. Multi-class classification uses softmax cross-entropy.

In PyTorch, use logits directly with `BCEWithLogitsLoss` and `CrossEntropyLoss`. This gives better numerical stability and simpler code. The selected loss must match the task, target shape, target type, and model output.