Skip to content

Calibration and Confidence

A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness.

A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness.

If a model predicts class “cat” with probability 0.8 on many images, then about 80 percent of those predictions should be correct. If only 60 percent are correct, the model is overconfident. If 95 percent are correct, the model is underconfident.

Logits, Probabilities, and Confidence

A classifier produces logits

zRK, z \in \mathbb{R}^{K},

where KK is the number of classes. The softmax function converts logits into probabilities:

pk=exp(zk)j=1Kexp(zj). p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.

The predicted class is

y^=argmaxkpk. \hat{y} = \arg\max_k p_k.

The confidence score is often taken as

maxkpk. \max_k p_k.

In PyTorch:

import torch

logits = model(images)
probs = torch.softmax(logits, dim=1)

confidence, preds = probs.max(dim=1)

This confidence value is convenient, but it may be wrong as a probability estimate. Modern neural networks can assign very high softmax scores to incorrect predictions.

Accuracy Versus Calibration

Accuracy measures how often predictions are correct. Calibration measures whether confidence values are numerically meaningful.

A model can have high accuracy and poor calibration. For example, a classifier may be correct 90 percent of the time while assigning 99 percent confidence to most predictions. It is accurate but overconfident.

A model can also have lower accuracy but better calibration. This matters in systems where confidence controls downstream decisions: deferral to humans, abstention, medical review, search ranking, alert thresholds, or safety checks.

PropertyQuestion answered
AccuracyHow often is the prediction correct?
ConfidenceHow certain does the model say it is?
CalibrationDoes stated confidence match empirical correctness?

Reliability Diagrams

A reliability diagram groups predictions by confidence and compares average confidence with empirical accuracy.

For example, collect all predictions with confidence between 0.7 and 0.8. If their average confidence is 0.75 and their accuracy is 0.75, that bin is well calibrated. If their accuracy is 0.60, the model is overconfident in that bin.

The ideal reliability diagram lies on the diagonal line:

accuracy=confidence. \text{accuracy} = \text{confidence}.

A simple binning procedure:

def calibration_bins(confidences, correct, num_bins=10):
    bins = torch.linspace(0, 1, num_bins + 1)

    bin_acc = []
    bin_conf = []
    bin_count = []

    for i in range(num_bins):
        low = bins[i]
        high = bins[i + 1]

        if i == num_bins - 1:
            mask = (confidences >= low) & (confidences <= high)
        else:
            mask = (confidences >= low) & (confidences < high)

        count = mask.sum().item()

        if count == 0:
            bin_acc.append(float("nan"))
            bin_conf.append(float("nan"))
            bin_count.append(0)
            continue

        bin_acc.append(correct[mask].float().mean().item())
        bin_conf.append(confidences[mask].mean().item())
        bin_count.append(count)

    return bin_acc, bin_conf, bin_count

Validation code:

@torch.no_grad()
def collect_confidence_stats(model, loader, device):
    model.eval()

    all_confidences = []
    all_correct = []

    for images, labels in loader:
        images = images.to(device)
        labels = labels.to(device)

        logits = model(images)
        probs = torch.softmax(logits, dim=1)

        confidences, preds = probs.max(dim=1)
        correct = preds.eq(labels)

        all_confidences.append(confidences.cpu())
        all_correct.append(correct.cpu())

    confidences = torch.cat(all_confidences)
    correct = torch.cat(all_correct)

    return confidences, correct

Expected Calibration Error

Expected Calibration Error, usually called ECE, summarizes calibration error across confidence bins.

For each bin BmB_m, compute the bin accuracy and bin confidence. ECE is the weighted average absolute gap:

ECE=m=1MBmnacc(Bm)conf(Bm). \operatorname{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \operatorname{acc}(B_m) - \operatorname{conf}(B_m) \right|.

Here MM is the number of bins and nn is the total number of predictions.

PyTorch implementation:

def expected_calibration_error(confidences, correct, num_bins=15):
    bins = torch.linspace(0, 1, num_bins + 1)

    ece = torch.zeros((), dtype=torch.float32)
    n = confidences.numel()

    for i in range(num_bins):
        low = bins[i]
        high = bins[i + 1]

        if i == num_bins - 1:
            mask = (confidences >= low) & (confidences <= high)
        else:
            mask = (confidences >= low) & (confidences < high)

        count = mask.sum()

        if count == 0:
            continue

        acc = correct[mask].float().mean()
        conf = confidences[mask].mean()

        ece += count.float() / n * torch.abs(acc - conf)

    return ece.item()

ECE is easy to compute, but it depends on bin count and binning scheme. It should be used as a diagnostic, not as the only measure of model quality.

Negative Log Likelihood and Brier Score

Calibration can also be evaluated with proper scoring rules. Two common choices are negative log likelihood and Brier score.

For a correct class yy, negative log likelihood is:

logpy. -\log p_y.

This is the same objective used by cross-entropy classification.

In PyTorch:

loss_fn = torch.nn.CrossEntropyLoss()
nll = loss_fn(logits, labels)

The Brier score measures squared error between the predicted probability vector and the one-hot label vector:

Brier=1ni=1nk=1K(pikyik)2. \operatorname{Brier} = \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} (p_{ik} - y_{ik})^2.

Implementation:

import torch.nn.functional as F

def brier_score(logits, labels, num_classes):
    probs = torch.softmax(logits, dim=1)
    targets = F.one_hot(labels, num_classes=num_classes).float()
    return ((probs - targets) ** 2).sum(dim=1).mean()

NLL strongly penalizes confident wrong predictions. Brier score gives a squared-error view of probability quality.

Temperature Scaling

Temperature scaling is a simple post-training calibration method. It divides logits by a positive scalar temperature TT:

pk=exp(zk/T)jexp(zj/T). p_k = \frac{\exp(z_k/T)} {\sum_j \exp(z_j/T)}.

When T>1T > 1, probabilities become softer and confidence decreases. When T<1T < 1, probabilities become sharper and confidence increases.

Temperature scaling does not change the predicted class because dividing all logits by the same positive number preserves their order. It changes confidence, not accuracy.

A PyTorch module:

import torch
import torch.nn as nn

class TemperatureScaler(nn.Module):
    def __init__(self, init_temperature=1.0):
        super().__init__()
        self.log_temperature = nn.Parameter(
            torch.log(torch.tensor(float(init_temperature)))
        )

    def forward(self, logits):
        temperature = torch.exp(self.log_temperature)
        return logits / temperature

Fit the temperature on a validation set by minimizing cross-entropy:

def fit_temperature(model, loader, device):
    model.eval()

    all_logits = []
    all_labels = []

    with torch.no_grad():
        for images, labels in loader:
            images = images.to(device)
            logits = model(images)

            all_logits.append(logits.detach())
            all_labels.append(labels.to(device))

    logits = torch.cat(all_logits)
    labels = torch.cat(all_labels)

    scaler = TemperatureScaler().to(device)

    optimizer = torch.optim.LBFGS(
        scaler.parameters(),
        lr=0.1,
        max_iter=50,
    )

    loss_fn = nn.CrossEntropyLoss()

    def closure():
        optimizer.zero_grad()
        scaled_logits = scaler(logits)
        loss = loss_fn(scaled_logits, labels)
        loss.backward()
        return loss

    optimizer.step(closure)

    return scaler

Use it during inference:

scaler = fit_temperature(model, val_loader, device)

with torch.no_grad():
    logits = model(images)
    calibrated_logits = scaler(logits)
    probs = torch.softmax(calibrated_logits, dim=1)

The validation set used for temperature fitting should be separate from the training set. Ideally, it should also be separate from the final test set.

Label Smoothing and Calibration

Label smoothing replaces hard one-hot labels with softened targets. For KK classes and smoothing value ϵ\epsilon, the target probability for the correct class becomes approximately 1ϵ1-\epsilon, while the remaining probability mass is spread across other classes.

In PyTorch:

loss_fn = torch.nn.CrossEntropyLoss(label_smoothing=0.1)

Label smoothing can reduce overconfidence because the model is not trained to assign probability 1 to the correct class. It often improves calibration and generalization, although it can sometimes make confidence too conservative.

Label smoothing changes training. Temperature scaling changes only post-training probabilities. These methods can be combined, but validation metrics should decide whether the combination is useful.

Confidence Thresholding

A calibrated confidence score can be used to abstain from uncertain predictions. Instead of forcing every image into a class, the system predicts only when confidence exceeds a threshold.

def predict_with_threshold(model, images, threshold):
    logits = model(images)
    probs = torch.softmax(logits, dim=1)

    confidence, preds = probs.max(dim=1)
    accepted = confidence >= threshold

    return preds, confidence, accepted

This produces two useful metrics:

MetricMeaning
CoverageFraction of examples accepted
Selective accuracyAccuracy on accepted examples

Coverage decreases as the threshold increases. Selective accuracy usually increases.

A validation procedure can choose a threshold:

def evaluate_threshold(confidences, correct, threshold):
    accepted = confidences >= threshold

    coverage = accepted.float().mean().item()

    if accepted.sum() == 0:
        selective_accuracy = float("nan")
    else:
        selective_accuracy = correct[accepted].float().mean().item()

    return coverage, selective_accuracy

In high-stakes settings, confidence thresholding should be paired with human review or a fallback system.

Top-k Confidence

For some applications, the top prediction alone is too narrow. Top-k evaluation checks whether the true class appears among the kk highest-scoring classes.

def topk_accuracy(logits, labels, k=5):
    _, topk = logits.topk(k, dim=1)
    correct = topk.eq(labels.view(-1, 1))
    return correct.any(dim=1).float().mean().item()

Top-k confidence can be useful for recommendation, retrieval, diagnosis assistance, or human-in-the-loop classification. It should not be confused with calibrated top-1 probability.

Out-of-Distribution Inputs

Softmax confidence can be misleading on inputs far from the training distribution. A model trained on animals may confidently classify a medical scan as “dog” or “cat” because softmax always distributes probability across known classes.

Calibration on in-distribution validation data does not guarantee good behavior on out-of-distribution inputs.

Common approaches include:

MethodBasic idea
Maximum softmax probabilityReject low-confidence predictions
Entropy thresholdReject high-entropy predictions
Energy scoreUse logit energy for detection
EnsemblesUse disagreement across models
Feature distanceReject inputs far from training features
Specialized OOD detectorTrain a separate detector

Maximum softmax probability is simple but limited. Out-of-distribution detection is a separate problem from ordinary calibration.

Confidence Under Distribution Shift

A model calibrated on one validation set may become miscalibrated under distribution shift. Examples include different cameras, different hospitals, new lighting conditions, seasonal changes, corrupted images, or changed class frequencies.

When deployment data changes, monitor both accuracy proxies and confidence distributions:

def confidence_summary(confidences):
    return {
        "mean": confidences.mean().item(),
        "median": confidences.median().item(),
        "p90": confidences.quantile(0.90).item(),
        "p99": confidences.quantile(0.99).item(),
    }

A sudden increase in average confidence can indicate overconfident failure. A sudden decrease can indicate unfamiliar inputs. Either case deserves inspection.

Common Calibration Mistakes

MistakeConsequence
Treating softmax as automatically calibratedOverconfident decisions
Calibrating on the test setBiased final estimate
Reporting only accuracyHidden confidence errors
Using random validation transformsNoisy calibration estimates
Changing preprocessing after calibrationInvalid temperature
Ignoring distribution shiftDeployment confidence drift
Using confidence as an OOD detector without testingFalse sense of safety

Calibration depends on the full pipeline: data, preprocessing, architecture, loss, augmentation, and deployment distribution.

Practical Calibration Workflow

A practical workflow is:

  1. Train the classifier normally.
  2. Evaluate accuracy, NLL, Brier score, and ECE on validation data.
  3. Plot or inspect reliability bins.
  4. Fit temperature scaling on validation logits.
  5. Recompute calibration metrics after scaling.
  6. Choose confidence thresholds if the application allows abstention.
  7. Test once on the held-out test set.
  8. Monitor confidence distributions after deployment.

A compact evaluation function:

@torch.no_grad()
def evaluate_calibration(model, loader, device, num_classes, scaler=None):
    model.eval()

    logits_list = []
    labels_list = []

    for images, labels in loader:
        images = images.to(device)
        labels = labels.to(device)

        logits = model(images)

        if scaler is not None:
            logits = scaler(logits)

        logits_list.append(logits.cpu())
        labels_list.append(labels.cpu())

    logits = torch.cat(logits_list)
    labels = torch.cat(labels_list)

    probs = torch.softmax(logits, dim=1)
    confidences, preds = probs.max(dim=1)
    correct = preds.eq(labels)

    return {
        "accuracy": correct.float().mean().item(),
        "nll": torch.nn.functional.cross_entropy(logits, labels).item(),
        "brier": brier_score(logits, labels, num_classes).item(),
        "ece": expected_calibration_error(confidences, correct),
        "mean_confidence": confidences.mean().item(),
    }

This gives a compact view of both correctness and probability quality.

Summary

Calibration asks whether confidence values mean what they claim. Accuracy measures correctness. Calibration measures whether predicted probabilities match observed correctness.

For PyTorch classifiers, calibration starts by collecting logits and labels on validation data. Reliability diagrams, ECE, NLL, and Brier score expose confidence errors. Temperature scaling is a simple and effective post-training method that adjusts confidence without changing predicted classes.

Confidence should be used carefully. It can support abstention, human review, and risk-aware deployment, but it does not solve out-of-distribution detection by itself.