# Calibration and Confidence

A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness.

If a model predicts class “cat” with probability 0.8 on many images, then about 80 percent of those predictions should be correct. If only 60 percent are correct, the model is overconfident. If 95 percent are correct, the model is underconfident.

### Logits, Probabilities, and Confidence

A classifier produces logits

$$
z \in \mathbb{R}^{K},
$$

where $K$ is the number of classes. The softmax function converts logits into probabilities:

$$
p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.
$$

The predicted class is

$$
\hat{y} = \arg\max_k p_k.
$$

The confidence score is often taken as

$$
\max_k p_k.
$$

In PyTorch:

```python
import torch

logits = model(images)
probs = torch.softmax(logits, dim=1)

confidence, preds = probs.max(dim=1)
```

This confidence value is convenient, but it may be wrong as a probability estimate. Modern neural networks can assign very high softmax scores to incorrect predictions.

### Accuracy Versus Calibration

Accuracy measures how often predictions are correct. Calibration measures whether confidence values are numerically meaningful.

A model can have high accuracy and poor calibration. For example, a classifier may be correct 90 percent of the time while assigning 99 percent confidence to most predictions. It is accurate but overconfident.

A model can also have lower accuracy but better calibration. This matters in systems where confidence controls downstream decisions: deferral to humans, abstention, medical review, search ranking, alert thresholds, or safety checks.

| Property | Question answered |
|---|---|
| Accuracy | How often is the prediction correct? |
| Confidence | How certain does the model say it is? |
| Calibration | Does stated confidence match empirical correctness? |

### Reliability Diagrams

A reliability diagram groups predictions by confidence and compares average confidence with empirical accuracy.

For example, collect all predictions with confidence between 0.7 and 0.8. If their average confidence is 0.75 and their accuracy is 0.75, that bin is well calibrated. If their accuracy is 0.60, the model is overconfident in that bin.

The ideal reliability diagram lies on the diagonal line:

$$
\text{accuracy} = \text{confidence}.
$$

A simple binning procedure:

```python
def calibration_bins(confidences, correct, num_bins=10):
    bins = torch.linspace(0, 1, num_bins + 1)

    bin_acc = []
    bin_conf = []
    bin_count = []

    for i in range(num_bins):
        low = bins[i]
        high = bins[i + 1]

        if i == num_bins - 1:
            mask = (confidences >= low) & (confidences <= high)
        else:
            mask = (confidences >= low) & (confidences < high)

        count = mask.sum().item()

        if count == 0:
            bin_acc.append(float("nan"))
            bin_conf.append(float("nan"))
            bin_count.append(0)
            continue

        bin_acc.append(correct[mask].float().mean().item())
        bin_conf.append(confidences[mask].mean().item())
        bin_count.append(count)

    return bin_acc, bin_conf, bin_count
```

Validation code:

```python
@torch.no_grad()
def collect_confidence_stats(model, loader, device):
    model.eval()

    all_confidences = []
    all_correct = []

    for images, labels in loader:
        images = images.to(device)
        labels = labels.to(device)

        logits = model(images)
        probs = torch.softmax(logits, dim=1)

        confidences, preds = probs.max(dim=1)
        correct = preds.eq(labels)

        all_confidences.append(confidences.cpu())
        all_correct.append(correct.cpu())

    confidences = torch.cat(all_confidences)
    correct = torch.cat(all_correct)

    return confidences, correct
```

### Expected Calibration Error

Expected Calibration Error, usually called ECE, summarizes calibration error across confidence bins.

For each bin $B_m$, compute the bin accuracy and bin confidence. ECE is the weighted average absolute gap:

$$
\operatorname{ECE} =
\sum_{m=1}^{M}
\frac{|B_m|}{n}
\left|
\operatorname{acc}(B_m) -
\operatorname{conf}(B_m)
\right|.
$$

Here $M$ is the number of bins and $n$ is the total number of predictions.

PyTorch implementation:

```python
def expected_calibration_error(confidences, correct, num_bins=15):
    bins = torch.linspace(0, 1, num_bins + 1)

    ece = torch.zeros((), dtype=torch.float32)
    n = confidences.numel()

    for i in range(num_bins):
        low = bins[i]
        high = bins[i + 1]

        if i == num_bins - 1:
            mask = (confidences >= low) & (confidences <= high)
        else:
            mask = (confidences >= low) & (confidences < high)

        count = mask.sum()

        if count == 0:
            continue

        acc = correct[mask].float().mean()
        conf = confidences[mask].mean()

        ece += count.float() / n * torch.abs(acc - conf)

    return ece.item()
```

ECE is easy to compute, but it depends on bin count and binning scheme. It should be used as a diagnostic, not as the only measure of model quality.

### Negative Log Likelihood and Brier Score

Calibration can also be evaluated with proper scoring rules. Two common choices are negative log likelihood and Brier score.

For a correct class $y$, negative log likelihood is:

$$
-\log p_y.
$$

This is the same objective used by cross-entropy classification.

In PyTorch:

```python
loss_fn = torch.nn.CrossEntropyLoss()
nll = loss_fn(logits, labels)
```

The Brier score measures squared error between the predicted probability vector and the one-hot label vector:

$$
\operatorname{Brier} =
\frac{1}{n}
\sum_{i=1}^{n}
\sum_{k=1}^{K}
(p_{ik} - y_{ik})^2.
$$

Implementation:

```python
import torch.nn.functional as F

def brier_score(logits, labels, num_classes):
    probs = torch.softmax(logits, dim=1)
    targets = F.one_hot(labels, num_classes=num_classes).float()
    return ((probs - targets) ** 2).sum(dim=1).mean()
```

NLL strongly penalizes confident wrong predictions. Brier score gives a squared-error view of probability quality.

### Temperature Scaling

Temperature scaling is a simple post-training calibration method. It divides logits by a positive scalar temperature $T$:

$$
p_k =
\frac{\exp(z_k/T)}
{\sum_j \exp(z_j/T)}.
$$

When $T > 1$, probabilities become softer and confidence decreases. When $T < 1$, probabilities become sharper and confidence increases.

Temperature scaling does not change the predicted class because dividing all logits by the same positive number preserves their order. It changes confidence, not accuracy.

A PyTorch module:

```python
import torch
import torch.nn as nn

class TemperatureScaler(nn.Module):
    def __init__(self, init_temperature=1.0):
        super().__init__()
        self.log_temperature = nn.Parameter(
            torch.log(torch.tensor(float(init_temperature)))
        )

    def forward(self, logits):
        temperature = torch.exp(self.log_temperature)
        return logits / temperature
```

Fit the temperature on a validation set by minimizing cross-entropy:

```python
def fit_temperature(model, loader, device):
    model.eval()

    all_logits = []
    all_labels = []

    with torch.no_grad():
        for images, labels in loader:
            images = images.to(device)
            logits = model(images)

            all_logits.append(logits.detach())
            all_labels.append(labels.to(device))

    logits = torch.cat(all_logits)
    labels = torch.cat(all_labels)

    scaler = TemperatureScaler().to(device)

    optimizer = torch.optim.LBFGS(
        scaler.parameters(),
        lr=0.1,
        max_iter=50,
    )

    loss_fn = nn.CrossEntropyLoss()

    def closure():
        optimizer.zero_grad()
        scaled_logits = scaler(logits)
        loss = loss_fn(scaled_logits, labels)
        loss.backward()
        return loss

    optimizer.step(closure)

    return scaler
```

Use it during inference:

```python
scaler = fit_temperature(model, val_loader, device)

with torch.no_grad():
    logits = model(images)
    calibrated_logits = scaler(logits)
    probs = torch.softmax(calibrated_logits, dim=1)
```

The validation set used for temperature fitting should be separate from the training set. Ideally, it should also be separate from the final test set.

### Label Smoothing and Calibration

Label smoothing replaces hard one-hot labels with softened targets. For $K$ classes and smoothing value $\epsilon$, the target probability for the correct class becomes approximately $1-\epsilon$, while the remaining probability mass is spread across other classes.

In PyTorch:

```python
loss_fn = torch.nn.CrossEntropyLoss(label_smoothing=0.1)
```

Label smoothing can reduce overconfidence because the model is not trained to assign probability 1 to the correct class. It often improves calibration and generalization, although it can sometimes make confidence too conservative.

Label smoothing changes training. Temperature scaling changes only post-training probabilities. These methods can be combined, but validation metrics should decide whether the combination is useful.

### Confidence Thresholding

A calibrated confidence score can be used to abstain from uncertain predictions. Instead of forcing every image into a class, the system predicts only when confidence exceeds a threshold.

```python
def predict_with_threshold(model, images, threshold):
    logits = model(images)
    probs = torch.softmax(logits, dim=1)

    confidence, preds = probs.max(dim=1)
    accepted = confidence >= threshold

    return preds, confidence, accepted
```

This produces two useful metrics:

| Metric | Meaning |
|---|---|
| Coverage | Fraction of examples accepted |
| Selective accuracy | Accuracy on accepted examples |

Coverage decreases as the threshold increases. Selective accuracy usually increases.

A validation procedure can choose a threshold:

```python
def evaluate_threshold(confidences, correct, threshold):
    accepted = confidences >= threshold

    coverage = accepted.float().mean().item()

    if accepted.sum() == 0:
        selective_accuracy = float("nan")
    else:
        selective_accuracy = correct[accepted].float().mean().item()

    return coverage, selective_accuracy
```

In high-stakes settings, confidence thresholding should be paired with human review or a fallback system.

### Top-k Confidence

For some applications, the top prediction alone is too narrow. Top-k evaluation checks whether the true class appears among the $k$ highest-scoring classes.

```python
def topk_accuracy(logits, labels, k=5):
    _, topk = logits.topk(k, dim=1)
    correct = topk.eq(labels.view(-1, 1))
    return correct.any(dim=1).float().mean().item()
```

Top-k confidence can be useful for recommendation, retrieval, diagnosis assistance, or human-in-the-loop classification. It should not be confused with calibrated top-1 probability.

### Out-of-Distribution Inputs

Softmax confidence can be misleading on inputs far from the training distribution. A model trained on animals may confidently classify a medical scan as “dog” or “cat” because softmax always distributes probability across known classes.

Calibration on in-distribution validation data does not guarantee good behavior on out-of-distribution inputs.

Common approaches include:

| Method | Basic idea |
|---|---|
| Maximum softmax probability | Reject low-confidence predictions |
| Entropy threshold | Reject high-entropy predictions |
| Energy score | Use logit energy for detection |
| Ensembles | Use disagreement across models |
| Feature distance | Reject inputs far from training features |
| Specialized OOD detector | Train a separate detector |

Maximum softmax probability is simple but limited. Out-of-distribution detection is a separate problem from ordinary calibration.

### Confidence Under Distribution Shift

A model calibrated on one validation set may become miscalibrated under distribution shift. Examples include different cameras, different hospitals, new lighting conditions, seasonal changes, corrupted images, or changed class frequencies.

When deployment data changes, monitor both accuracy proxies and confidence distributions:

```python
def confidence_summary(confidences):
    return {
        "mean": confidences.mean().item(),
        "median": confidences.median().item(),
        "p90": confidences.quantile(0.90).item(),
        "p99": confidences.quantile(0.99).item(),
    }
```

A sudden increase in average confidence can indicate overconfident failure. A sudden decrease can indicate unfamiliar inputs. Either case deserves inspection.

### Common Calibration Mistakes

| Mistake | Consequence |
|---|---|
| Treating softmax as automatically calibrated | Overconfident decisions |
| Calibrating on the test set | Biased final estimate |
| Reporting only accuracy | Hidden confidence errors |
| Using random validation transforms | Noisy calibration estimates |
| Changing preprocessing after calibration | Invalid temperature |
| Ignoring distribution shift | Deployment confidence drift |
| Using confidence as an OOD detector without testing | False sense of safety |

Calibration depends on the full pipeline: data, preprocessing, architecture, loss, augmentation, and deployment distribution.

### Practical Calibration Workflow

A practical workflow is:

1. Train the classifier normally.
2. Evaluate accuracy, NLL, Brier score, and ECE on validation data.
3. Plot or inspect reliability bins.
4. Fit temperature scaling on validation logits.
5. Recompute calibration metrics after scaling.
6. Choose confidence thresholds if the application allows abstention.
7. Test once on the held-out test set.
8. Monitor confidence distributions after deployment.

A compact evaluation function:

```python
@torch.no_grad()
def evaluate_calibration(model, loader, device, num_classes, scaler=None):
    model.eval()

    logits_list = []
    labels_list = []

    for images, labels in loader:
        images = images.to(device)
        labels = labels.to(device)

        logits = model(images)

        if scaler is not None:
            logits = scaler(logits)

        logits_list.append(logits.cpu())
        labels_list.append(labels.cpu())

    logits = torch.cat(logits_list)
    labels = torch.cat(labels_list)

    probs = torch.softmax(logits, dim=1)
    confidences, preds = probs.max(dim=1)
    correct = preds.eq(labels)

    return {
        "accuracy": correct.float().mean().item(),
        "nll": torch.nn.functional.cross_entropy(logits, labels).item(),
        "brier": brier_score(logits, labels, num_classes).item(),
        "ece": expected_calibration_error(confidences, correct),
        "mean_confidence": confidences.mean().item(),
    }
```

This gives a compact view of both correctness and probability quality.

### Summary

Calibration asks whether confidence values mean what they claim. Accuracy measures correctness. Calibration measures whether predicted probabilities match observed correctness.

For PyTorch classifiers, calibration starts by collecting logits and labels on validation data. Reliability diagrams, ECE, NLL, and Brier score expose confidence errors. Temperature scaling is a simple and effective post-training method that adjusts confidence without changing predicted classes.

Confidence should be used carefully. It can support abstention, human review, and risk-aware deployment, but it does not solve out-of-distribution detection by itself.

