Skip to content

Evaluation Metrics

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.

For example, a classifier may train with cross-entropy loss, but report accuracy, precision, recall, F1 score, calibration error, and confusion matrices. The loss helps optimization. The metrics help judge whether the model is useful.

Loss Versus Metric

A loss function is optimized during training:

θ=argminθLtrain(θ). \theta^\ast = \arg\min_\theta L_{\text{train}}(\theta).

A metric is used to evaluate behavior:

M(fθ,Deval). M(f_\theta, \mathcal{D}_{\text{eval}}).

A good loss should be differentiable and stable for gradient descent. A good metric should match the practical objective.

For example, accuracy is easy to understand, but it is not differentiable. Cross-entropy is differentiable and works well for training classifiers, even when final reporting uses accuracy.

ConceptMain roleExample
LossOptimize parametersCross-entropy
MetricEvaluate modelAccuracy
Training lossFit modelMini-batch objective
Validation metricSelect modelF1 score
Test metricFinal reportROC-AUC

Accuracy

Accuracy is the fraction of predictions that are correct:

Accuracy=number of correct predictionsnumber of examples. \text{Accuracy} = \frac{\text{number of correct predictions}} {\text{number of examples}}.

For NN examples:

Accuracy=1Ni=1N1[y^(i)=y(i)]. \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\hat{y}^{(i)} = y^{(i)}].

Accuracy is useful when classes are balanced and all mistakes have similar cost.

In PyTorch:

import torch

logits = torch.tensor([
    [2.1, 0.3, -1.0],
    [0.2, 1.7, 0.1],
    [0.4, 0.5, 2.2],
])

targets = torch.tensor([0, 1, 1])

preds = logits.argmax(dim=1)
accuracy = (preds == targets).float().mean()

print(preds)      # tensor([0, 1, 2])
print(accuracy)   # tensor(0.6667)

Accuracy can be misleading on imbalanced data. If 99% of examples belong to class 0, a model that always predicts class 0 gets 99% accuracy while failing on the rare class.

Confusion Matrix

A confusion matrix counts predictions by true class and predicted class.

For binary classification:

Predicted positivePredicted negative
True positiveTPFN
True negativeFPTN

The four entries are:

TermMeaning
True positivePositive example predicted positive
False positiveNegative example predicted positive
True negativeNegative example predicted negative
False negativePositive example predicted negative

A confusion matrix shows which mistakes the model makes. This is often more useful than a single scalar metric.

For multiclass classification, the matrix has shape

K×K, K \times K,

where rows usually represent true labels and columns represent predicted labels.

Precision and Recall

Precision measures how many predicted positives are truly positive:

Precision=TPTP+FP. \text{Precision} = \frac{TP}{TP + FP}.

Recall measures how many true positives are found:

Recall=TPTP+FN. \text{Recall} = \frac{TP}{TP + FN}.

Precision answers: when the model predicts positive, how often is it right?

Recall answers: among all real positives, how many did the model find?

These metrics matter when classes are imbalanced or when false positives and false negatives have different costs.

Examples:

ApplicationMore important metric
Spam filteringPrecision, to avoid hiding valid email
Cancer screeningRecall, to avoid missing disease
Fraud detectionDepends on investigation cost
Search retrievalPrecision at top ranks and recall

F1 Score

The F1 score combines precision and recall using the harmonic mean:

F1=2PrecisionRecallPrecision+Recall. F_1 = 2 \cdot \frac{ \text{Precision}\cdot\text{Recall} }{ \text{Precision}+\text{Recall} }.

F1 is high only when both precision and recall are high.

It is useful when the positive class is important and class imbalance makes accuracy weak.

For multiclass problems, F1 can be averaged in several ways:

AverageMeaning
Micro F1Counts global TP, FP, FN
Macro F1Computes F1 per class, then averages
Weighted F1Averages per-class F1 weighted by class frequency

Macro F1 gives each class equal importance. Weighted F1 gives common classes more influence. Micro F1 behaves similarly to accuracy in single-label multiclass classification.

Thresholds

Many classifiers output probabilities or scores. A threshold converts scores into class predictions.

For binary classification:

y^={1if pτ,0if p<τ. \hat{y} = \begin{cases} 1 & \text{if } p \ge \tau, \\ 0 & \text{if } p < \tau. \end{cases}

The default threshold is often τ=0.5\tau = 0.5. This may be wrong for imbalanced tasks or asymmetric costs.

Lowering the threshold increases recall and may reduce precision. Raising the threshold increases precision and may reduce recall.

Threshold selection should usually be done on the validation set, not the test set.

probs = torch.tensor([0.95, 0.71, 0.52, 0.41, 0.08])
threshold = 0.7

preds = (probs >= threshold).long()
print(preds)  # tensor([1, 1, 0, 0, 0])

ROC-AUC

The receiver operating characteristic curve plots true positive rate against false positive rate across thresholds.

The true positive rate is recall:

TPR=TPTP+FN. TPR = \frac{TP}{TP+FN}.

The false positive rate is

FPR=FPFP+TN. FPR = \frac{FP}{FP+TN}.

ROC-AUC is the area under this curve.

A ROC-AUC of 0.5 corresponds to random ranking. A ROC-AUC of 1.0 corresponds to perfect ranking.

ROC-AUC is threshold-independent, but it can look overly optimistic for highly imbalanced datasets. In rare-positive settings, precision-recall curves are often more informative.

Precision-Recall AUC

A precision-recall curve plots precision against recall across thresholds.

PR-AUC is especially useful when the positive class is rare.

For example, in fraud detection, fraudulent transactions may be less than 1% of the dataset. A classifier can have strong ROC-AUC while still producing too many false positives. PR-AUC focuses directly on positive-class retrieval quality.

Use PR-AUC when the main question is: can the model find rare positives without flooding the system with false alarms?

Regression Metrics

Regression predicts continuous values. Common metrics include mean absolute error, mean squared error, root mean squared error, and R2R^2.

Mean absolute error:

MAE=1Ni=1Ny(i)y^(i). MAE = \frac{1}{N} \sum_{i=1}^{N} |y^{(i)}-\hat{y}^{(i)}|.

Mean squared error:

MSE=1Ni=1N(y(i)y^(i))2. MSE = \frac{1}{N} \sum_{i=1}^{N} (y^{(i)}-\hat{y}^{(i)})^2.

Root mean squared error:

RMSE=MSE. RMSE = \sqrt{MSE}.

The R2R^2 score measures the fraction of target variance explained by the model:

R2=1i(y(i)y^(i))2i(y(i)yˉ)2. R^2 = 1 - \frac{ \sum_i (y^{(i)}-\hat{y}^{(i)})^2 }{ \sum_i (y^{(i)}-\bar{y})^2 }.
MetricUseful whenSensitivity
MAEWant error in original unitsRobust to outliers
MSELarge errors should hurt moreSensitive to outliers
RMSEWant original units with squared penaltySensitive to outliers
R2R^2Need relative explanatory powerCan be negative

Calibration

A model is calibrated when its predicted probabilities match empirical frequencies.

If a calibrated classifier predicts probability 0.8 for many examples, then about 80% of those examples should be correct.

Accuracy and calibration are different. A classifier can be accurate but overconfident.

For example, if a model is correct 90% of the time but assigns 99% confidence to its predictions, it is overconfident.

Calibration matters in medical systems, risk scoring, decision support, and any setting where probabilities guide actions.

Common calibration metrics include expected calibration error, reliability diagrams, and Brier score.

The Brier score for binary classification is

Brier=1Ni=1N(p(i)y(i))2. \text{Brier} = \frac{1}{N} \sum_{i=1}^{N} (p^{(i)} - y^{(i)})^2.

Ranking Metrics

Some systems return ranked lists rather than one prediction.

Examples include search engines, recommender systems, retrieval systems, and question answering systems.

Common ranking metrics:

MetricMeaning
Precision@kFraction of top kk results that are relevant
Recall@kFraction of all relevant items found in top kk
Mean reciprocal rankAverage inverse rank of first relevant result
Mean average precisionAverage precision across recall levels
NDCGRewards relevant items near the top

Precision@k is simple:

Precision@k=number of relevant results in top kk. \text{Precision@k} = \frac{ \text{number of relevant results in top } k }{k}.

Ranking metrics are essential for retrieval-augmented generation, recommendation, search, and embedding models.

Language Model Metrics

Language models are commonly evaluated with negative log-likelihood, cross-entropy, and perplexity.

For a token sequence x1,,xTx_1,\dots,x_T, the average negative log-likelihood is

L=1Tt=1Tlogpθ(xtx<t). L = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

Perplexity is

PPL=eL. \text{PPL} = e^L.

Lower perplexity means the model assigns higher probability to the observed text.

Perplexity is useful but incomplete. A model with lower perplexity may still produce worse answers for instruction following, reasoning, factuality, or safety. Modern language models also require task benchmarks, human preference evaluation, retrieval metrics, and safety evaluation.

Segmentation Metrics

Image segmentation predicts a label for each pixel.

Common metrics include pixel accuracy, intersection over union, and Dice coefficient.

For one class, intersection over union is

IoU=PYPY, IoU = \frac{|P \cap Y|} {|P \cup Y|},

where PP is the predicted region and YY is the true region.

The Dice coefficient is

Dice=2PYP+Y. Dice = \frac{2|P \cap Y|} {|P| + |Y|}.

IoU penalizes both false positives and false negatives. Dice is widely used in medical image segmentation because it works well for small foreground regions.

Choosing the Right Metric

Metric choice should follow the real cost structure of the problem.

Ask:

QuestionMetric implication
Are classes imbalanced?Use precision, recall, F1, PR-AUC
Are false negatives expensive?Emphasize recall
Are false positives expensive?Emphasize precision
Are probabilities used for decisions?Check calibration
Is output ranked?Use ranking metrics
Is target continuous?Use MAE, RMSE, R2R^2
Is deployment distribution different?Evaluate by slice and time

A single metric rarely gives a complete picture. Practical evaluation usually combines one primary metric with several diagnostic metrics.

Metric Computation in PyTorch

A simple classification metric function:

import torch

def accuracy_from_logits(logits, targets):
    preds = logits.argmax(dim=-1)
    return (preds == targets).float().mean().item()

For binary precision and recall:

def precision_recall_from_probs(probs, targets, threshold=0.5):
    preds = (probs >= threshold).long()
    targets = targets.long()

    tp = ((preds == 1) & (targets == 1)).sum().item()
    fp = ((preds == 1) & (targets == 0)).sum().item()
    fn = ((preds == 0) & (targets == 1)).sum().item()

    precision = tp / (tp + fp) if tp + fp > 0 else 0.0
    recall = tp / (tp + fn) if tp + fn > 0 else 0.0

    return precision, recall

For large projects, metrics should be accumulated across the full validation or test set, not averaged incorrectly across batches with different sizes.

Evaluation by Slice

Aggregate metrics can hide failures.

A model may have good overall accuracy but poor performance on specific groups, languages, devices, regions, classes, or time periods.

Slice evaluation computes metrics on meaningful subsets:

SliceExample
ClassAccuracy per category
DifficultyShort versus long sequences
SourceDifferent websites or sensors
TimeRecent versus old data
User groupNew users versus returning users
Input lengthShort prompts versus long prompts

Slice evaluation often reveals the real weaknesses of a system.

Summary

Evaluation metrics measure model performance. Loss functions optimize models, while metrics judge whether models solve the intended task.

Accuracy is useful for balanced classification. Precision, recall, F1, ROC-AUC, and PR-AUC are better for imbalanced or asymmetric problems. Regression uses MAE, MSE, RMSE, and R2R^2. Ranking systems need ranking metrics. Language models require likelihood metrics plus task and human-centered evaluations.

Metric selection is part of problem design. A poor metric can reward the wrong behavior even when the model trains correctly.