Skip to content

Bayesian Neural Networks

A Bayesian neural network is a neural network whose parameters are treated as random variables rather than fixed unknown constants.

A Bayesian neural network is a neural network whose parameters are treated as random variables rather than fixed unknown constants.

In an ordinary neural network, training produces one set of weights. After training, the model makes predictions using those weights. In a Bayesian neural network, training produces a probability distribution over possible weights. Prediction then averages over many plausible networks, weighted by how well each network explains the data.

The central idea is uncertainty. A standard neural network gives a prediction. A Bayesian neural network gives a prediction together with a measure of how uncertain the model is about that prediction.

From Deterministic Weights to Random Weights

Consider a neural network

fθ(x), f_\theta(x),

where xx is the input and θ\theta denotes all trainable parameters: weights, biases, embeddings, normalization parameters, and so on.

In ordinary deep learning, θ\theta is a fixed vector learned by optimization. Training solves a problem of the form

θ=argminθL(θ). \theta^\star = \arg\min_\theta L(\theta).

After training, prediction uses

y^=fθ(x). \hat{y} = f_{\theta^\star}(x).

This gives one model.

Bayesian learning uses a different view. The parameters θ\theta are uncertain. Before seeing data, we describe that uncertainty using a prior distribution

p(θ). p(\theta).

After observing a dataset

D={(xi,yi)}i=1N, D = \{(x_i, y_i)\}_{i=1}^{N},

we update our belief about θ\theta using Bayes’ rule:

p(θD)=p(Dθ)p(θ)p(D). p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}.

The posterior distribution p(θD)p(\theta \mid D) describes which parameter values remain plausible after observing the data.

The Prior

The prior p(θ)p(\theta) encodes assumptions about the parameters before training data is observed.

A common simple prior is an independent Gaussian prior:

p(θ)=N(0,σ2I). p(\theta) = \mathcal{N}(0, \sigma^2 I).

This prior says that small weights are more plausible than large weights. It is closely related to L2 regularization in ordinary neural network training.

For a single weight ww, the prior may be

wN(0,σ2). w \sim \mathcal{N}(0, \sigma^2).

A small value of σ2\sigma^2 strongly prefers weights near zero. A large value of σ2\sigma^2 allows more flexible functions.

The prior affects generalization. If the prior is too restrictive, the model may underfit. If the prior is too broad, the model may become too uncertain or computationally difficult to infer.

The Likelihood

The likelihood p(Dθ)p(D \mid \theta) describes how likely the observed data is under a given parameter setting.

For independent observations,

p(Dθ)=i=1Np(yixi,θ). p(D \mid \theta) = \prod_{i=1}^{N} p(y_i \mid x_i, \theta).

For regression, we often assume Gaussian observation noise:

yi=fθ(xi)+ϵi,ϵiN(0,σy2). y_i = f_\theta(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_y^2).

Then

p(yixi,θ)=N(yi;fθ(xi),σy2). p(y_i \mid x_i, \theta) = \mathcal{N}(y_i; f_\theta(x_i), \sigma_y^2).

For classification, the network usually outputs logits. A softmax converts logits into class probabilities:

p(yi=kxi,θ)=exp(zk)jexp(zj), p(y_i = k \mid x_i, \theta) = \frac{\exp(z_k)}{\sum_j \exp(z_j)},

where

z=fθ(xi). z = f_\theta(x_i).

The likelihood connects the neural network to probability theory. The network does not merely output a value. It defines a probability model for the target.

The Posterior

The posterior distribution is the main object of Bayesian learning:

p(θD)p(Dθ)p(θ). p(\theta \mid D) \propto p(D \mid \theta)p(\theta).

The symbol \propto means “proportional to.” The omitted term is the evidence

p(D)=p(Dθ)p(θ)dθ. p(D) = \int p(D \mid \theta)p(\theta)\,d\theta.

The evidence normalizes the posterior so that it integrates to one.

In small Bayesian models, the posterior can sometimes be computed exactly. In neural networks, exact computation is usually intractable. The parameter space may contain millions or billions of dimensions, and the posterior can be highly non-Gaussian.

This is the central computational difficulty of Bayesian neural networks. The theory is simple. The exact inference problem is usually impossible to solve directly.

Bayesian Prediction

A Bayesian neural network predicts by averaging over the posterior distribution:

p(yx,D)=p(yx,θ)p(θD)dθ. p(y^\star \mid x^\star, D) = \int p(y^\star \mid x^\star, \theta)p(\theta \mid D)\,d\theta.

This is called the posterior predictive distribution.

It says: to predict the output for a new input xx^\star, consider every plausible parameter setting θ\theta, compute the prediction under that setting, and average these predictions according to posterior probability.

In practice, the integral is approximated using samples:

p(yx,D)1Ss=1Sp(yx,θs),θsp(θD). p(y^\star \mid x^\star, D) \approx \frac{1}{S} \sum_{s=1}^{S} p(y^\star \mid x^\star, \theta_s), \quad \theta_s \sim p(\theta \mid D).

For regression, we may sample several networks and compute several outputs:

fθ1(x),fθ2(x),,fθS(x). f_{\theta_1}(x^\star), f_{\theta_2}(x^\star), \ldots, f_{\theta_S}(x^\star).

The mean gives the prediction. The spread gives uncertainty.

Types of Uncertainty

Bayesian neural networks are mainly used to represent two kinds of uncertainty.

Aleatoric uncertainty comes from noise in the data itself. For example, a blurry image, a noisy sensor, or an inherently random process may produce uncertain labels. More training data cannot fully remove aleatoric uncertainty.

Epistemic uncertainty comes from lack of knowledge. The model is uncertain because it has not seen enough relevant data. More training data can reduce epistemic uncertainty.

This distinction matters in deployment. A medical model should be uncertain when an input is far from its training distribution. An autonomous system should know when the environment differs from its experience. A scientific model should report uncertainty when extrapolating beyond observed data.

Ordinary neural networks often produce overconfident predictions. Bayesian methods attempt to make uncertainty explicit.

Maximum A Posteriori Estimation

A useful bridge between ordinary training and Bayesian training is maximum a posteriori estimation.

Instead of computing the full posterior, MAP estimation finds the most probable parameter setting:

θMAP=argmaxθp(θD). \theta_{\text{MAP}} = \arg\max_\theta p(\theta \mid D).

Using Bayes’ rule,

θMAP=argmaxθ[logp(Dθ)+logp(θ)]. \theta_{\text{MAP}} = \arg\max_\theta \left[ \log p(D \mid \theta) + \log p(\theta) \right].

Equivalently, we minimize the negative log posterior:

θMAP=argminθ[logp(Dθ)logp(θ)]. \theta_{\text{MAP}} = \arg\min_\theta \left[ -\log p(D \mid \theta) - \log p(\theta) \right].

If the likelihood corresponds to cross-entropy loss and the prior is Gaussian, then MAP estimation becomes standard neural network training with weight decay.

Thus, ordinary regularized training can be interpreted as a point estimate approximation to Bayesian learning. It finds one good parameter setting instead of representing a distribution over many plausible settings.

Why Exact Bayesian Neural Networks Are Hard

Exact Bayesian inference in neural networks is difficult for several reasons.

The parameter dimension is large. Modern neural networks may have millions, billions, or trillions of parameters. A full posterior over such a space is too large to store explicitly.

The posterior is complex. Neural networks are nonlinear in their parameters. This creates multimodal, curved, and highly correlated posterior distributions.

The evidence integral is intractable. Computing

p(D)=p(Dθ)p(θ)dθ p(D) = \int p(D \mid \theta)p(\theta)\,d\theta

requires integrating over all parameter settings.

Prediction is also intractable. The posterior predictive distribution contains another integral over parameters.

Because of these difficulties, practical Bayesian neural networks rely on approximation.

Approximate Inference Methods

Several methods are used to approximate Bayesian neural networks.

Variational inference replaces the true posterior p(θD)p(\theta \mid D) with a simpler distribution qϕ(θ)q_\phi(\theta). The parameters ϕ\phi are optimized so that qϕq_\phi is close to the true posterior.

A common choice is a factorized Gaussian:

qϕ(θ)=jN(θj;μj,σj2). q_\phi(\theta) = \prod_j \mathcal{N}(\theta_j; \mu_j, \sigma_j^2).

This approximation is easy to sample from, but it ignores many correlations between parameters.

Monte Carlo methods approximate the posterior using samples. Markov chain Monte Carlo can be accurate in principle, but it is often too expensive for large neural networks.

Laplace approximation fits a Gaussian distribution around a trained solution. It uses local curvature information to approximate posterior uncertainty.

Deep ensembles train several neural networks independently and treat their variation as an approximate uncertainty estimate. They are not fully Bayesian, but they often perform well in practice.

Monte Carlo dropout uses dropout at inference time. Each forward pass samples a different dropped-out network. The variation across predictions provides an approximate uncertainty estimate.

Bayesian Linear Layer

A Bayesian neural network can be built by replacing deterministic layers with Bayesian layers.

A deterministic linear layer computes

y=Wx+b. y = Wx + b.

A Bayesian linear layer treats WW and bb as random variables:

Wqϕ(W),bqϕ(b). W \sim q_\phi(W), \quad b \sim q_\phi(b).

A forward pass samples weights:

W(s)qϕ(W),b(s)qϕ(b), W^{(s)} \sim q_\phi(W), \quad b^{(s)} \sim q_\phi(b),

then computes

y(s)=W(s)x+b(s). y^{(s)} = W^{(s)}x + b^{(s)}.

Multiple forward passes give multiple outputs. Their variation represents model uncertainty.

In PyTorch, a simplified Bayesian linear layer may store a mean and log standard deviation for each weight:

import torch
from torch import nn
import torch.nn.functional as F

class BayesianLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()

        self.weight_mu = nn.Parameter(torch.zeros(out_features, in_features))
        self.weight_log_sigma = nn.Parameter(torch.full((out_features, in_features), -5.0))

        self.bias_mu = nn.Parameter(torch.zeros(out_features))
        self.bias_log_sigma = nn.Parameter(torch.full((out_features,), -5.0))

    def forward(self, x):
        weight_sigma = torch.exp(self.weight_log_sigma)
        bias_sigma = torch.exp(self.bias_log_sigma)

        weight_eps = torch.randn_like(self.weight_mu)
        bias_eps = torch.randn_like(self.bias_mu)

        weight = self.weight_mu + weight_sigma * weight_eps
        bias = self.bias_mu + bias_sigma * bias_eps

        return F.linear(x, weight, bias)

This layer uses the reparameterization trick. Instead of sampling directly from

N(μ,σ2), \mathcal{N}(\mu, \sigma^2),

we sample

ϵN(0,1) \epsilon \sim \mathcal{N}(0, 1)

and compute

w=μ+σϵ. w = \mu + \sigma \epsilon.

This makes the sampling operation differentiable with respect to μ\mu and σ\sigma.

Variational Objective

In variational Bayesian neural networks, we choose an approximate posterior qϕ(θ)q_\phi(\theta). We want it to be close to the true posterior p(θD)p(\theta \mid D).

The standard objective is the evidence lower bound, or ELBO:

L(ϕ)=Eqϕ(θ)[logp(Dθ)]KL(qϕ(θ)p(θ)). \mathcal{L}(\phi) = \mathbb{E}_{q_\phi(\theta)} [ \log p(D \mid \theta) ] - \mathrm{KL}(q_\phi(\theta) \| p(\theta)).

The first term rewards parameter samples that explain the data. The second term penalizes approximate posteriors that move too far from the prior.

Training maximizes the ELBO. Equivalently, one may minimize the negative ELBO:

L(ϕ)=Eqϕ(θ)[logp(Dθ)]+KL(qϕ(θ)p(θ)). -\mathcal{L}(\phi) = -\mathbb{E}_{q_\phi(\theta)} [ \log p(D \mid \theta) ] + \mathrm{KL}(q_\phi(\theta) \| p(\theta)).

This has the same structure as ordinary training plus regularization. The data term is a likelihood loss. The KL term is a Bayesian regularizer.

Prediction by Monte Carlo Sampling

At inference time, a Bayesian neural network usually makes several stochastic forward passes.

For classification:

@torch.no_grad()
def predict_proba(model, x, num_samples=30):
    probs = []

    for _ in range(num_samples):
        logits = model(x)
        probs.append(torch.softmax(logits, dim=-1))

    probs = torch.stack(probs, dim=0)
    return probs.mean(dim=0), probs.std(dim=0)

The mean gives the predictive probability. The standard deviation gives a simple uncertainty measure.

For regression:

@torch.no_grad()
def predict_regression(model, x, num_samples=30):
    preds = []

    for _ in range(num_samples):
        preds.append(model(x))

    preds = torch.stack(preds, dim=0)
    return preds.mean(dim=0), preds.std(dim=0)

The mean prediction is the posterior predictive mean. The standard deviation measures predictive spread.

Bayesian Neural Networks and Ensembles

Deep ensembles are closely related to Bayesian prediction.

Instead of sampling weights from a posterior distribution, we train several models with different random initializations, data orders, or bootstrap samples. At inference time, we average their predictions:

p(yx)1Mm=1Mp(yx,θm). p(y \mid x) \approx \frac{1}{M} \sum_{m=1}^{M} p(y \mid x, \theta_m).

This resembles Bayesian model averaging, although the θm\theta_m values are not exact posterior samples.

Deep ensembles are often simpler and more scalable than fully Bayesian neural networks. Their disadvantage is cost. Training and serving multiple models may be expensive.

Bayesian neural networks try to obtain similar uncertainty benefits within a single probabilistic model.

Calibration

A model is calibrated when its confidence matches its accuracy.

Suppose a classifier assigns probability 0.8 to many predictions. If the model is well calibrated, about 80 percent of those predictions should be correct.

Many deep networks are poorly calibrated. They can assign high confidence to wrong predictions, especially under distribution shift.

Bayesian neural networks can improve calibration because they average over parameter uncertainty. When the model lacks evidence, different parameter samples may disagree. This disagreement reduces confidence.

Calibration is important in medicine, finance, robotics, scientific modeling, and any setting where an incorrect high-confidence prediction is costly.

Strengths and Weaknesses

Bayesian neural networks provide a principled treatment of uncertainty. They combine neural network function approximation with Bayesian probability. They can express epistemic uncertainty, improve calibration, and support decision making under uncertainty.

Their main weakness is computational cost. Exact inference is intractable, and approximate inference can be expensive or inaccurate. Simple approximations may underestimate uncertainty. More accurate methods may fail to scale to modern architectures.

A practical view is that Bayesian neural networks are one point in a broader design space. Other uncertainty methods, such as ensembles, dropout sampling, Laplace approximation, and probabilistic output heads, may be preferable depending on the application.

Summary

A Bayesian neural network treats model parameters as random variables. It begins with a prior over weights, updates this prior using data, and obtains a posterior distribution over plausible networks.

Prediction uses the posterior predictive distribution, which averages over parameter uncertainty. This gives both predictions and uncertainty estimates.

The exact posterior is usually intractable for neural networks, so practical Bayesian neural networks rely on approximate inference. Variational inference, Monte Carlo sampling, Laplace approximation, dropout sampling, and ensembles are common approaches.

The main value of Bayesian neural networks is not higher accuracy alone. Their value is better uncertainty estimation, especially when the model operates under limited data, noisy observations, or distribution shift.