# Evaluation Metrics

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.

For example, a classifier may train with cross-entropy loss, but report accuracy, precision, recall, F1 score, calibration error, and confusion matrices. The loss helps optimization. The metrics help judge whether the model is useful.

### Loss Versus Metric

A loss function is optimized during training:

$$
\theta^\ast =
\arg\min_\theta
L_{\text{train}}(\theta).
$$

A metric is used to evaluate behavior:

$$
M(f_\theta, \mathcal{D}_{\text{eval}}).
$$

A good loss should be differentiable and stable for gradient descent. A good metric should match the practical objective.

For example, accuracy is easy to understand, but it is not differentiable. Cross-entropy is differentiable and works well for training classifiers, even when final reporting uses accuracy.

| Concept | Main role | Example |
|---|---|---|
| Loss | Optimize parameters | Cross-entropy |
| Metric | Evaluate model | Accuracy |
| Training loss | Fit model | Mini-batch objective |
| Validation metric | Select model | F1 score |
| Test metric | Final report | ROC-AUC |

### Accuracy

Accuracy is the fraction of predictions that are correct:

$$
\text{Accuracy} =
\frac{\text{number of correct predictions}}
{\text{number of examples}}.
$$

For \(N\) examples:

$$
\text{Accuracy} =
\frac{1}{N}
\sum_{i=1}^{N}
\mathbf{1}[\hat{y}^{(i)} = y^{(i)}].
$$

Accuracy is useful when classes are balanced and all mistakes have similar cost.

In PyTorch:

```python
import torch

logits = torch.tensor([
    [2.1, 0.3, -1.0],
    [0.2, 1.7, 0.1],
    [0.4, 0.5, 2.2],
])

targets = torch.tensor([0, 1, 1])

preds = logits.argmax(dim=1)
accuracy = (preds == targets).float().mean()

print(preds)      # tensor([0, 1, 2])
print(accuracy)   # tensor(0.6667)
```

Accuracy can be misleading on imbalanced data. If 99% of examples belong to class 0, a model that always predicts class 0 gets 99% accuracy while failing on the rare class.

### Confusion Matrix

A confusion matrix counts predictions by true class and predicted class.

For binary classification:

| | Predicted positive | Predicted negative |
|---|---:|---:|
| True positive | TP | FN |
| True negative | FP | TN |

The four entries are:

| Term | Meaning |
|---|---|
| True positive | Positive example predicted positive |
| False positive | Negative example predicted positive |
| True negative | Negative example predicted negative |
| False negative | Positive example predicted negative |

A confusion matrix shows which mistakes the model makes. This is often more useful than a single scalar metric.

For multiclass classification, the matrix has shape

$$
K \times K,
$$

where rows usually represent true labels and columns represent predicted labels.

### Precision and Recall

Precision measures how many predicted positives are truly positive:

$$
\text{Precision} =
\frac{TP}{TP + FP}.
$$

Recall measures how many true positives are found:

$$
\text{Recall} =
\frac{TP}{TP + FN}.
$$

Precision answers: when the model predicts positive, how often is it right?

Recall answers: among all real positives, how many did the model find?

These metrics matter when classes are imbalanced or when false positives and false negatives have different costs.

Examples:

| Application | More important metric |
|---|---|
| Spam filtering | Precision, to avoid hiding valid email |
| Cancer screening | Recall, to avoid missing disease |
| Fraud detection | Depends on investigation cost |
| Search retrieval | Precision at top ranks and recall |

### F1 Score

The F1 score combines precision and recall using the harmonic mean:

$$
F_1 =
2
\cdot
\frac{
\text{Precision}\cdot\text{Recall}
}{
\text{Precision}+\text{Recall}
}.
$$

F1 is high only when both precision and recall are high.

It is useful when the positive class is important and class imbalance makes accuracy weak.

For multiclass problems, F1 can be averaged in several ways:

| Average | Meaning |
|---|---|
| Micro F1 | Counts global TP, FP, FN |
| Macro F1 | Computes F1 per class, then averages |
| Weighted F1 | Averages per-class F1 weighted by class frequency |

Macro F1 gives each class equal importance. Weighted F1 gives common classes more influence. Micro F1 behaves similarly to accuracy in single-label multiclass classification.

### Thresholds

Many classifiers output probabilities or scores. A threshold converts scores into class predictions.

For binary classification:

$$
\hat{y} =
\begin{cases}
1 & \text{if } p \ge \tau, \\
0 & \text{if } p < \tau.
\end{cases}
$$

The default threshold is often \(\tau = 0.5\). This may be wrong for imbalanced tasks or asymmetric costs.

Lowering the threshold increases recall and may reduce precision. Raising the threshold increases precision and may reduce recall.

Threshold selection should usually be done on the validation set, not the test set.

```python
probs = torch.tensor([0.95, 0.71, 0.52, 0.41, 0.08])
threshold = 0.7

preds = (probs >= threshold).long()
print(preds)  # tensor([1, 1, 0, 0, 0])
```

### ROC-AUC

The receiver operating characteristic curve plots true positive rate against false positive rate across thresholds.

The true positive rate is recall:

$$
TPR = \frac{TP}{TP+FN}.
$$

The false positive rate is

$$
FPR = \frac{FP}{FP+TN}.
$$

ROC-AUC is the area under this curve.

A ROC-AUC of 0.5 corresponds to random ranking. A ROC-AUC of 1.0 corresponds to perfect ranking.

ROC-AUC is threshold-independent, but it can look overly optimistic for highly imbalanced datasets. In rare-positive settings, precision-recall curves are often more informative.

### Precision-Recall AUC

A precision-recall curve plots precision against recall across thresholds.

PR-AUC is especially useful when the positive class is rare.

For example, in fraud detection, fraudulent transactions may be less than 1% of the dataset. A classifier can have strong ROC-AUC while still producing too many false positives. PR-AUC focuses directly on positive-class retrieval quality.

Use PR-AUC when the main question is: can the model find rare positives without flooding the system with false alarms?

### Regression Metrics

Regression predicts continuous values. Common metrics include mean absolute error, mean squared error, root mean squared error, and \(R^2\).

Mean absolute error:

$$
MAE =
\frac{1}{N}
\sum_{i=1}^{N}
|y^{(i)}-\hat{y}^{(i)}|.
$$

Mean squared error:

$$
MSE =
\frac{1}{N}
\sum_{i=1}^{N}
(y^{(i)}-\hat{y}^{(i)})^2.
$$

Root mean squared error:

$$
RMSE = \sqrt{MSE}.
$$

The \(R^2\) score measures the fraction of target variance explained by the model:

$$
R^2 =
1 -
\frac{
\sum_i (y^{(i)}-\hat{y}^{(i)})^2
}{
\sum_i (y^{(i)}-\bar{y})^2
}.
$$

| Metric | Useful when | Sensitivity |
|---|---|---|
| MAE | Want error in original units | Robust to outliers |
| MSE | Large errors should hurt more | Sensitive to outliers |
| RMSE | Want original units with squared penalty | Sensitive to outliers |
| \(R^2\) | Need relative explanatory power | Can be negative |

### Calibration

A model is calibrated when its predicted probabilities match empirical frequencies.

If a calibrated classifier predicts probability 0.8 for many examples, then about 80% of those examples should be correct.

Accuracy and calibration are different. A classifier can be accurate but overconfident.

For example, if a model is correct 90% of the time but assigns 99% confidence to its predictions, it is overconfident.

Calibration matters in medical systems, risk scoring, decision support, and any setting where probabilities guide actions.

Common calibration metrics include expected calibration error, reliability diagrams, and Brier score.

The Brier score for binary classification is

$$
\text{Brier} =
\frac{1}{N}
\sum_{i=1}^{N}
(p^{(i)} - y^{(i)})^2.
$$

### Ranking Metrics

Some systems return ranked lists rather than one prediction.

Examples include search engines, recommender systems, retrieval systems, and question answering systems.

Common ranking metrics:

| Metric | Meaning |
|---|---|
| Precision@k | Fraction of top \(k\) results that are relevant |
| Recall@k | Fraction of all relevant items found in top \(k\) |
| Mean reciprocal rank | Average inverse rank of first relevant result |
| Mean average precision | Average precision across recall levels |
| NDCG | Rewards relevant items near the top |

Precision@k is simple:

$$
\text{Precision@k} =
\frac{
\text{number of relevant results in top } k
}{k}.
$$

Ranking metrics are essential for retrieval-augmented generation, recommendation, search, and embedding models.

### Language Model Metrics

Language models are commonly evaluated with negative log-likelihood, cross-entropy, and perplexity.

For a token sequence \(x_1,\dots,x_T\), the average negative log-likelihood is

$$
L =
-\frac{1}{T}
\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{<t}).
$$

Perplexity is

$$
\text{PPL} = e^L.
$$

Lower perplexity means the model assigns higher probability to the observed text.

Perplexity is useful but incomplete. A model with lower perplexity may still produce worse answers for instruction following, reasoning, factuality, or safety. Modern language models also require task benchmarks, human preference evaluation, retrieval metrics, and safety evaluation.

### Segmentation Metrics

Image segmentation predicts a label for each pixel.

Common metrics include pixel accuracy, intersection over union, and Dice coefficient.

For one class, intersection over union is

$$
IoU =
\frac{|P \cap Y|}
{|P \cup Y|},
$$

where \(P\) is the predicted region and \(Y\) is the true region.

The Dice coefficient is

$$
Dice =
\frac{2|P \cap Y|}
{|P| + |Y|}.
$$

IoU penalizes both false positives and false negatives. Dice is widely used in medical image segmentation because it works well for small foreground regions.

### Choosing the Right Metric

Metric choice should follow the real cost structure of the problem.

Ask:

| Question | Metric implication |
|---|---|
| Are classes imbalanced? | Use precision, recall, F1, PR-AUC |
| Are false negatives expensive? | Emphasize recall |
| Are false positives expensive? | Emphasize precision |
| Are probabilities used for decisions? | Check calibration |
| Is output ranked? | Use ranking metrics |
| Is target continuous? | Use MAE, RMSE, \(R^2\) |
| Is deployment distribution different? | Evaluate by slice and time |

A single metric rarely gives a complete picture. Practical evaluation usually combines one primary metric with several diagnostic metrics.

### Metric Computation in PyTorch

A simple classification metric function:

```python
import torch

def accuracy_from_logits(logits, targets):
    preds = logits.argmax(dim=-1)
    return (preds == targets).float().mean().item()
```

For binary precision and recall:

```python
def precision_recall_from_probs(probs, targets, threshold=0.5):
    preds = (probs >= threshold).long()
    targets = targets.long()

    tp = ((preds == 1) & (targets == 1)).sum().item()
    fp = ((preds == 1) & (targets == 0)).sum().item()
    fn = ((preds == 0) & (targets == 1)).sum().item()

    precision = tp / (tp + fp) if tp + fp > 0 else 0.0
    recall = tp / (tp + fn) if tp + fn > 0 else 0.0

    return precision, recall
```

For large projects, metrics should be accumulated across the full validation or test set, not averaged incorrectly across batches with different sizes.

### Evaluation by Slice

Aggregate metrics can hide failures.

A model may have good overall accuracy but poor performance on specific groups, languages, devices, regions, classes, or time periods.

Slice evaluation computes metrics on meaningful subsets:

| Slice | Example |
|---|---|
| Class | Accuracy per category |
| Difficulty | Short versus long sequences |
| Source | Different websites or sensors |
| Time | Recent versus old data |
| User group | New users versus returning users |
| Input length | Short prompts versus long prompts |

Slice evaluation often reveals the real weaknesses of a system.

### Summary

Evaluation metrics measure model performance. Loss functions optimize models, while metrics judge whether models solve the intended task.

Accuracy is useful for balanced classification. Precision, recall, F1, ROC-AUC, and PR-AUC are better for imbalanced or asymmetric problems. Regression uses MAE, MSE, RMSE, and \(R^2\). Ranking systems need ranking metrics. Language models require likelihood metrics plus task and human-centered evaluations.

Metric selection is part of problem design. A poor metric can reward the wrong behavior even when the model trains correctly.