Skip to content

Uncertainty Estimation

Uncertainty estimation measures how much confidence a model should place in its own predictions.

Uncertainty estimation measures how much confidence a model should place in its own predictions. In ordinary supervised learning, a model usually returns a point prediction or a probability distribution over classes. In uncertainty-aware learning, the model also reports how reliable that prediction is.

This matters because high accuracy on a test set does not guarantee safe behavior under distribution shift, noisy inputs, missing features, adversarial perturbations, or rare cases. A model should know when it has enough evidence, and when it should defer, request more information, or warn the downstream system.

Predictive Uncertainty

For a supervised model, the predictive distribution is

p(yx,D), p(y \mid x, D),

where DD is the training data. This distribution describes what the model believes about the output yy for a new input xx.

In classification, predictive uncertainty appears as a probability vector:

p(yx,D)=[p1,p2,,pK], p(y \mid x, D) = [p_1, p_2, \ldots, p_K],

where KK is the number of classes.

A confident prediction may look like

[0.97,0.01,0.02]. [0.97, 0.01, 0.02].

An uncertain prediction may look like

[0.36,0.33,0.31]. [0.36, 0.33, 0.31].

In regression, uncertainty is often represented by a predictive mean and variance:

p(yx,D)=N(μ(x),σ2(x)). p(y \mid x, D) = \mathcal{N}(\mu(x), \sigma^2(x)).

The mean gives the prediction. The variance gives the uncertainty.

Aleatoric and Epistemic Uncertainty

Uncertainty is usually divided into two types.

Aleatoric uncertainty comes from noise in the data-generating process. It remains even if the model has unlimited training data. Sensor noise, ambiguous labels, measurement error, and inherently random outcomes produce aleatoric uncertainty.

Epistemic uncertainty comes from lack of knowledge. It appears when the model has limited data, sees unfamiliar inputs, or extrapolates outside the training distribution. More relevant data can reduce epistemic uncertainty.

For example, in image classification, a blurry image of a cat may have high aleatoric uncertainty because the input is intrinsically ambiguous. An image of a medical device given to a model trained only on animals may have high epistemic uncertainty because the input is outside the model’s experience.

A useful model should represent both.

Uncertainty in Classification

A classifier usually outputs logits

z=fθ(x). z = f_\theta(x).

Softmax converts logits into class probabilities:

pk=exp(zk)j=1Kexp(zj). p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.

The largest probability is often used as a confidence score:

maxkpk. \max_k p_k.

This score is simple, but it can be misleading. Neural networks can assign high softmax confidence to incorrect predictions, especially on out-of-distribution inputs.

A better uncertainty measure is predictive entropy:

H[p(yx)]=k=1Kpklogpk. H[p(y\mid x)] = -\sum_{k=1}^{K} p_k \log p_k.

Low entropy means the probability mass is concentrated on one class. High entropy means the probability mass is spread across many classes.

import torch

def predictive_entropy(probs, eps=1e-8):
    return -(probs * torch.log(probs + eps)).sum(dim=-1)

Uncertainty in Regression

For regression, a model can output both a mean and a variance:

μ(x),σ2(x). \mu(x), \quad \sigma^2(x).

The prediction is then modeled as

yN(μ(x),σ2(x)). y \sim \mathcal{N}(\mu(x), \sigma^2(x)).

The loss is the negative log likelihood:

L=12logσ2(x)+(yμ(x))22σ2(x). \mathcal{L} = \frac{1}{2}\log\sigma^2(x) + \frac{(y-\mu(x))^2}{2\sigma^2(x)}.

This objective encourages the model to predict large variance where the target is noisy, but penalizes unnecessary inflation of uncertainty through the log variance term.

In PyTorch:

import torch
from torch import nn

class GaussianRegressionHead(nn.Module):
    def __init__(self, in_features):
        super().__init__()
        self.mean = nn.Linear(in_features, 1)
        self.log_var = nn.Linear(in_features, 1)

    def forward(self, x):
        mean = self.mean(x)
        log_var = self.log_var(x).clamp(min=-10.0, max=10.0)
        return mean, log_var

def gaussian_nll(mean, log_var, target):
    inv_var = torch.exp(-log_var)
    loss = 0.5 * log_var + 0.5 * (target - mean).pow(2) * inv_var
    return loss.mean()

This approach mostly captures aleatoric uncertainty because the variance is predicted from the input.

Bayesian Uncertainty

Bayesian models estimate uncertainty by maintaining a distribution over model parameters:

p(θD). p(\theta \mid D).

Prediction averages over this posterior:

p(yx,D)=p(yx,θ)p(θD)dθ. p(y \mid x, D) = \int p(y \mid x, \theta)p(\theta \mid D)\,d\theta.

When different plausible parameter settings disagree, epistemic uncertainty is high.

In practice, this integral is approximated with samples:

p(yx,D)1Ss=1Sp(yx,θs). p(y \mid x, D) \approx \frac{1}{S} \sum_{s=1}^{S} p(y \mid x, \theta_s).

The spread across sampled predictions measures uncertainty.

Deep Ensembles

A deep ensemble trains several models independently:

fθ1,fθ2,,fθM. f_{\theta_1}, f_{\theta_2}, \ldots, f_{\theta_M}.

Each model may use a different initialization, data order, bootstrap sample, or training configuration.

At inference time, predictions are averaged:

p(yx)=1Mm=1Mp(yx,θm). p(y\mid x) = \frac{1}{M} \sum_{m=1}^{M} p(y\mid x,\theta_m).

Disagreement among ensemble members indicates epistemic uncertainty.

Deep ensembles are often strong in practice because they are simple, parallelizable, and compatible with existing training pipelines. Their main cost is that training and serving require multiple models.

import torch
import torch.nn.functional as F

@torch.no_grad()
def ensemble_predict(models, x):
    probs = []

    for model in models:
        model.eval()
        logits = model(x)
        probs.append(F.softmax(logits, dim=-1))

    probs = torch.stack(probs, dim=0)
    mean_probs = probs.mean(dim=0)
    disagreement = probs.var(dim=0).mean(dim=-1)

    return mean_probs, disagreement

Monte Carlo Dropout

Monte Carlo dropout keeps dropout active during inference. Each forward pass samples a different dropout mask, giving a different subnetwork.

p(yx)1Ss=1Sp(s)(yx). p(y\mid x) \approx \frac{1}{S} \sum_{s=1}^{S} p^{(s)}(y\mid x).

This gives approximate uncertainty at low implementation cost.

from torch import nn

def enable_dropout(model):
    for module in model.modules():
        if isinstance(module, nn.Dropout):
            module.train()

Then run several stochastic forward passes and average the probabilities.

Monte Carlo dropout is easy to add to existing models, but it usually gives weaker uncertainty estimates than well-trained ensembles.

Mutual Information

For Bayesian or ensemble classifiers, predictive entropy mixes aleatoric and epistemic uncertainty. Mutual information can isolate model uncertainty.

Let

pm(yx) p_m(y\mid x)

be the prediction of ensemble member mm, and let

pˉ(yx)=1Mm=1Mpm(yx). \bar{p}(y\mid x) = \frac{1}{M} \sum_{m=1}^{M} p_m(y\mid x).

Predictive entropy is

H[pˉ(yx)]. H[\bar{p}(y\mid x)].

Expected entropy is

1Mm=1MH[pm(yx)]. \frac{1}{M} \sum_{m=1}^{M} H[p_m(y\mid x)].

Mutual information is

I[y,θx,D]=H[pˉ(yx)]1Mm=1MH[pm(yx)]. I[y,\theta\mid x,D] = H[\bar{p}(y\mid x)] - \frac{1}{M} \sum_{m=1}^{M} H[p_m(y\mid x)].

High mutual information means the models disagree. This usually indicates epistemic uncertainty.

import torch

def ensemble_mutual_information(probs, eps=1e-8):
    # probs shape: [num_models, batch, num_classes]
    mean_probs = probs.mean(dim=0)

    predictive_entropy = -(mean_probs * torch.log(mean_probs + eps)).sum(dim=-1)
    expected_entropy = -(probs * torch.log(probs + eps)).sum(dim=-1).mean(dim=0)

    return predictive_entropy - expected_entropy

Calibration

A model is calibrated when its predicted probabilities match empirical frequencies.

If a classifier gives probability 0.8 to many examples, then about 80 percent of those predictions should be correct.

Calibration differs from accuracy. A model can be accurate but overconfident, or less accurate but better calibrated.

A simple calibration metric is expected calibration error, or ECE. It groups predictions into confidence bins and compares average confidence with average accuracy in each bin.

import torch

def expected_calibration_error(probs, labels, num_bins=15):
    confidences, predictions = probs.max(dim=-1)
    accuracies = predictions.eq(labels)

    ece = torch.zeros((), device=probs.device)

    bin_edges = torch.linspace(0, 1, num_bins + 1, device=probs.device)

    for i in range(num_bins):
        lower = bin_edges[i]
        upper = bin_edges[i + 1]

        in_bin = confidences.gt(lower) & confidences.le(upper)
        prop_in_bin = in_bin.float().mean()

        if prop_in_bin.item() > 0:
            accuracy = accuracies[in_bin].float().mean()
            confidence = confidences[in_bin].mean()
            ece += prop_in_bin * torch.abs(confidence - accuracy)

    return ece

Calibration is especially important when probabilities are used for decisions, not just ranking.

Temperature Scaling

Temperature scaling is a simple post-training calibration method.

Given logits zz, calibrated probabilities are computed as

pk=exp(zk/T)jexp(zj/T). p_k = \frac{\exp(z_k/T)} {\sum_j \exp(z_j/T)}.

The scalar T>0T>0 is fitted on a validation set.

If T>1T>1, the softmax distribution becomes softer and less confident. If T<1T<1, it becomes sharper.

Temperature scaling does not change the predicted class. It only changes confidence.

import torch
from torch import nn

class TemperatureScaler(nn.Module):
    def __init__(self):
        super().__init__()
        self.log_temperature = nn.Parameter(torch.zeros(()))

    def forward(self, logits):
        temperature = torch.exp(self.log_temperature)
        return logits / temperature

Out-of-Distribution Detection

Out-of-distribution detection asks whether an input comes from a different distribution than the training data.

A classifier trained on handwritten digits may receive a photo of a car. A medical model trained on one scanner may receive images from another scanner. A language model trained for one domain may receive legal or medical text.

Uncertainty measures can help detect such cases.

Common scores include:

ScoreIdea
Maximum softmax probabilityLow max probability suggests uncertainty
Predictive entropyHigh entropy suggests uncertainty
Ensemble disagreementHigh disagreement suggests epistemic uncertainty
Energy scoreUses log-sum-exp of logits
Mahalanobis distanceMeasures distance in feature space
Density scoreMeasures likelihood under a learned data model

No score is universally reliable. OOD detection should be evaluated on realistic shift scenarios.

Conformal Prediction

Conformal prediction builds prediction sets with statistical coverage guarantees under exchangeability assumptions.

Instead of returning one class, a conformal classifier returns a set of possible classes:

Γ(x){1,,K}. \Gamma(x) \subseteq \{1,\ldots,K\}.

For example, with 90 percent coverage, the method aims to ensure that the true label lies in the set about 90 percent of the time.

For regression, conformal prediction returns intervals:

[y^lower,y^upper]. [\hat{y}_{\text{lower}}, \hat{y}_{\text{upper}}].

Conformal methods can be combined with neural networks. The neural network provides scores, and conformal calibration converts those scores into sets or intervals.

This is useful when reliable coverage matters more than always producing a single prediction.

Practical Comparison

MethodCapturesCostNotes
Softmax confidenceRough predictive confidenceLowOften overconfident
EntropyPredictive uncertaintyLowSimple classification score
Gaussian regression headAleatoric uncertaintyLowGood for noisy regression
MC dropoutApproximate epistemic uncertaintyMediumEasy to retrofit
Deep ensemblesEpistemic uncertaintyHighStrong practical baseline
Bayesian neural networksPrincipled posterior uncertaintyHighApproximate inference required
Conformal predictionCoverage guaranteesMediumUses calibration data

Practical Guidance

Use softmax confidence only as a baseline. It is easy to compute, but it often fails under distribution shift.

For regression with noisy labels, use probabilistic output heads and train with negative log likelihood.

For classification systems where uncertainty matters, evaluate calibration using ECE, reliability diagrams, and validation likelihood.

For high-stakes systems, compare at least one ensemble method and one post-hoc calibration method.

For deployment, decide what the system should do when uncertainty is high. It may abstain, ask a human, request more input, route to a stronger model, or refuse to automate the decision.

Summary

Uncertainty estimation turns a neural network from a pure predictor into a risk-aware model.

Aleatoric uncertainty comes from noise in the data. Epistemic uncertainty comes from lack of knowledge. Bayesian models, ensembles, dropout sampling, probabilistic output heads, calibration, and conformal prediction are common tools for estimating and controlling uncertainty.

Good uncertainty estimates are essential when models operate outside clean benchmark settings.