Evaluation Metrics

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.

For example, a classifier may train with cross-entropy loss, but report accuracy, precision, recall, F1 score, calibration error, and confusion matrices. The loss helps optimization. The metrics help judge whether the model is useful.

Loss Versus Metric

A loss function is optimized during training:

\theta^\ast = \arg\min_\theta L_{\text{train}}(\theta).

A metric is used to evaluate behavior:

M(f_\theta, \mathcal{D}_{\text{eval}}).

A good loss should be differentiable and stable for gradient descent. A good metric should match the practical objective.

For example, accuracy is easy to understand, but it is not differentiable. Cross-entropy is differentiable and works well for training classifiers, even when final reporting uses accuracy.

Concept	Main role	Example
Loss	Optimize parameters	Cross-entropy
Metric	Evaluate model	Accuracy
Training loss	Fit model	Mini-batch objective
Validation metric	Select model	F1 score
Test metric	Final report	ROC-AUC

Accuracy

Accuracy is the fraction of predictions that are correct:

\text{Accuracy} = \frac{\text{number of correct predictions}} {\text{number of examples}}.

For $N$ examples:

\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\hat{y}^{(i)} = y^{(i)}].

Accuracy is useful when classes are balanced and all mistakes have similar cost.

In PyTorch:

import torch

logits = torch.tensor([
    [2.1, 0.3, -1.0],
    [0.2, 1.7, 0.1],
    [0.4, 0.5, 2.2],
])

targets = torch.tensor([0, 1, 1])

preds = logits.argmax(dim=1)
accuracy = (preds == targets).float().mean()

print(preds)      # tensor([0, 1, 2])
print(accuracy)   # tensor(0.6667)

Accuracy can be misleading on imbalanced data. If 99% of examples belong to class 0, a model that always predicts class 0 gets 99% accuracy while failing on the rare class.

Confusion Matrix

A confusion matrix counts predictions by true class and predicted class.

For binary classification:

	Predicted positive	Predicted negative
True positive	TP	FN
True negative	FP	TN

The four entries are:

Term	Meaning
True positive	Positive example predicted positive
False positive	Negative example predicted positive
True negative	Negative example predicted negative
False negative	Positive example predicted negative

A confusion matrix shows which mistakes the model makes. This is often more useful than a single scalar metric.

For multiclass classification, the matrix has shape

K \times K,

where rows usually represent true labels and columns represent predicted labels.

Precision and Recall

Precision measures how many predicted positives are truly positive:

\text{Precision} = \frac{TP}{TP + FP}.

Recall measures how many true positives are found:

\text{Recall} = \frac{TP}{TP + FN}.

Precision answers: when the model predicts positive, how often is it right?

Recall answers: among all real positives, how many did the model find?

These metrics matter when classes are imbalanced or when false positives and false negatives have different costs.

Examples:

Application	More important metric
Spam filtering	Precision, to avoid hiding valid email
Cancer screening	Recall, to avoid missing disease
Fraud detection	Depends on investigation cost
Search retrieval	Precision at top ranks and recall

F1 Score

The F1 score combines precision and recall using the harmonic mean:

F_1 = 2 \cdot \frac{ \text{Precision}\cdot\text{Recall} }{ \text{Precision}+\text{Recall} }.

F1 is high only when both precision and recall are high.

It is useful when the positive class is important and class imbalance makes accuracy weak.

For multiclass problems, F1 can be averaged in several ways:

Average	Meaning
Micro F1	Counts global TP, FP, FN
Macro F1	Computes F1 per class, then averages
Weighted F1	Averages per-class F1 weighted by class frequency

Macro F1 gives each class equal importance. Weighted F1 gives common classes more influence. Micro F1 behaves similarly to accuracy in single-label multiclass classification.

Thresholds

Many classifiers output probabilities or scores. A threshold converts scores into class predictions.

For binary classification:

\hat{y} = \begin{cases} 1 & \text{if } p \ge \tau, \\ 0 & \text{if } p < \tau. \end{cases}

The default threshold is often $\tau = 0.5$ . This may be wrong for imbalanced tasks or asymmetric costs.

Lowering the threshold increases recall and may reduce precision. Raising the threshold increases precision and may reduce recall.

Threshold selection should usually be done on the validation set, not the test set.

probs = torch.tensor([0.95, 0.71, 0.52, 0.41, 0.08])
threshold = 0.7

preds = (probs >= threshold).long()
print(preds)  # tensor([1, 1, 0, 0, 0])

ROC-AUC

The receiver operating characteristic curve plots true positive rate against false positive rate across thresholds.

The true positive rate is recall:

TPR = \frac{TP}{TP+FN}.

The false positive rate is

FPR = \frac{FP}{FP+TN}.

ROC-AUC is the area under this curve.

A ROC-AUC of 0.5 corresponds to random ranking. A ROC-AUC of 1.0 corresponds to perfect ranking.

ROC-AUC is threshold-independent, but it can look overly optimistic for highly imbalanced datasets. In rare-positive settings, precision-recall curves are often more informative.

Precision-Recall AUC

A precision-recall curve plots precision against recall across thresholds.

PR-AUC is especially useful when the positive class is rare.

For example, in fraud detection, fraudulent transactions may be less than 1% of the dataset. A classifier can have strong ROC-AUC while still producing too many false positives. PR-AUC focuses directly on positive-class retrieval quality.

Use PR-AUC when the main question is: can the model find rare positives without flooding the system with false alarms?

Regression Metrics

Regression predicts continuous values. Common metrics include mean absolute error, mean squared error, root mean squared error, and $R^2$ .

Mean absolute error:

MAE = \frac{1}{N} \sum_{i=1}^{N} |y^{(i)}-\hat{y}^{(i)}|.

Mean squared error:

MSE = \frac{1}{N} \sum_{i=1}^{N} (y^{(i)}-\hat{y}^{(i)})^2.

Root mean squared error:

RMSE = \sqrt{MSE}.

The $R^2$ score measures the fraction of target variance explained by the model:

R^2 = 1 - \frac{ \sum_i (y^{(i)}-\hat{y}^{(i)})^2 }{ \sum_i (y^{(i)}-\bar{y})^2 }.

Metric	Useful when	Sensitivity
MAE	Want error in original units	Robust to outliers
MSE	Large errors should hurt more	Sensitive to outliers
RMSE	Want original units with squared penalty	Sensitive to outliers
$R^2$	Need relative explanatory power	Can be negative

Calibration

A model is calibrated when its predicted probabilities match empirical frequencies.

If a calibrated classifier predicts probability 0.8 for many examples, then about 80% of those examples should be correct.

Accuracy and calibration are different. A classifier can be accurate but overconfident.

For example, if a model is correct 90% of the time but assigns 99% confidence to its predictions, it is overconfident.

Calibration matters in medical systems, risk scoring, decision support, and any setting where probabilities guide actions.

Common calibration metrics include expected calibration error, reliability diagrams, and Brier score.

The Brier score for binary classification is

\text{Brier} = \frac{1}{N} \sum_{i=1}^{N} (p^{(i)} - y^{(i)})^2.

Ranking Metrics

Some systems return ranked lists rather than one prediction.

Examples include search engines, recommender systems, retrieval systems, and question answering systems.

Common ranking metrics:

Metric	Meaning
Precision@k	Fraction of top $k$ results that are relevant
Recall@k	Fraction of all relevant items found in top $k$
Mean reciprocal rank	Average inverse rank of first relevant result
Mean average precision	Average precision across recall levels
NDCG	Rewards relevant items near the top

Precision@k is simple:

\text{Precision@k} = \frac{ \text{number of relevant results in top } k }{k}.

Ranking metrics are essential for retrieval-augmented generation, recommendation, search, and embedding models.

Language Model Metrics

Language models are commonly evaluated with negative log-likelihood, cross-entropy, and perplexity.

For a token sequence $x_1,\dots,x_T$ , the average negative log-likelihood is

L = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

Perplexity is

\text{PPL} = e^L.

Lower perplexity means the model assigns higher probability to the observed text.

Perplexity is useful but incomplete. A model with lower perplexity may still produce worse answers for instruction following, reasoning, factuality, or safety. Modern language models also require task benchmarks, human preference evaluation, retrieval metrics, and safety evaluation.

Segmentation Metrics

Image segmentation predicts a label for each pixel.

Common metrics include pixel accuracy, intersection over union, and Dice coefficient.

For one class, intersection over union is

IoU = \frac{|P \cap Y|} {|P \cup Y|},

where $P$ is the predicted region and $Y$ is the true region.

The Dice coefficient is

Dice = \frac{2|P \cap Y|} {|P| + |Y|}.

IoU penalizes both false positives and false negatives. Dice is widely used in medical image segmentation because it works well for small foreground regions.

Choosing the Right Metric

Metric choice should follow the real cost structure of the problem.

Ask:

Question	Metric implication
Are classes imbalanced?	Use precision, recall, F1, PR-AUC
Are false negatives expensive?	Emphasize recall
Are false positives expensive?	Emphasize precision
Are probabilities used for decisions?	Check calibration
Is output ranked?	Use ranking metrics
Is target continuous?	Use MAE, RMSE, $R^2$
Is deployment distribution different?	Evaluate by slice and time

A single metric rarely gives a complete picture. Practical evaluation usually combines one primary metric with several diagnostic metrics.

Metric Computation in PyTorch

A simple classification metric function:

import torch

def accuracy_from_logits(logits, targets):
    preds = logits.argmax(dim=-1)
    return (preds == targets).float().mean().item()

For binary precision and recall:

def precision_recall_from_probs(probs, targets, threshold=0.5):
    preds = (probs >= threshold).long()
    targets = targets.long()

    tp = ((preds == 1) & (targets == 1)).sum().item()
    fp = ((preds == 1) & (targets == 0)).sum().item()
    fn = ((preds == 0) & (targets == 1)).sum().item()

    precision = tp / (tp + fp) if tp + fp > 0 else 0.0
    recall = tp / (tp + fn) if tp + fn > 0 else 0.0

    return precision, recall

For large projects, metrics should be accumulated across the full validation or test set, not averaged incorrectly across batches with different sizes.

Evaluation by Slice

Aggregate metrics can hide failures.

A model may have good overall accuracy but poor performance on specific groups, languages, devices, regions, classes, or time periods.

Slice evaluation computes metrics on meaningful subsets:

Slice	Example
Class	Accuracy per category
Difficulty	Short versus long sequences
Source	Different websites or sensors
Time	Recent versus old data
User group	New users versus returning users
Input length	Short prompts versus long prompts

Slice evaluation often reveals the real weaknesses of a system.

Summary

Evaluation metrics measure model performance. Loss functions optimize models, while metrics judge whether models solve the intended task.

Accuracy is useful for balanced classification. Precision, recall, F1, ROC-AUC, and PR-AUC are better for imbalanced or asymmetric problems. Regression uses MAE, MSE, RMSE, and $R^2$ . Ranking systems need ranking metrics. Language models require likelihood metrics plus task and human-centered evaluations.

Metric selection is part of problem design. A poor metric can reward the wrong behavior even when the model trains correctly.