Bayesian neural networks require inference over a posterior distribution:
For modern neural networks, this posterior is usually impossible to compute exactly. The parameter space is extremely high-dimensional, the likelihood is nonlinear, and the evidence integral is intractable.
Variational inference transforms this inference problem into an optimization problem. Instead of computing the true posterior directly, we approximate it with a simpler distribution and optimize that approximation.
This is one of the most important ideas in modern probabilistic deep learning.
The Central Idea
Suppose the true posterior is
We introduce a simpler distribution
called the variational distribution.
The parameters control the shape of the approximation. Variational inference chooses so that
The true posterior may be complicated, multimodal, and correlated. The variational distribution is chosen to be computationally manageable.
The core problem becomes:
We therefore replace probabilistic integration with optimization.
The Kullback-Leibler Divergence
Variational inference measures the difference between two probability distributions using the Kullback-Leibler divergence.
For two distributions and ,
genui{“math_block_widget_always_prefetch_v2”:{“content”:“D_{KL}(q(x)\|p(x))=\int q(x)\log\frac{q(x)}{p(x)}\,dx”}}
The KL divergence is always nonnegative:
It equals zero only when the two distributions are identical almost everywhere.
The KL divergence is not symmetric:
This asymmetry matters in variational inference because minimizing
encourages the approximation to concentrate on regions where the posterior density is large.
Why Direct KL Minimization Is Impossible
The posterior is
where
The evidence term is usually intractable.
The KL divergence becomes
Substituting Bayes’ rule:
The evidence prevents direct optimization.
Variational inference avoids this difficulty by deriving a lower bound on the log evidence.
The Evidence Lower Bound
The central objective in variational inference is the evidence lower bound, or ELBO.
Starting from
we insert the variational distribution:
Applying Jensen’s inequality gives
This lower bound is the ELBO:
Equivalently,
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mathcal{L}(\phi)=\mathbb{E}{q\phi(\theta)}[\log p(D\mid\theta)]-D_{KL}(q_\phi(\theta)\|p(\theta))"}}
This form is more intuitive.
The first term is the expected data likelihood. It rewards parameter samples that explain the observed data.
The second term is a regularizer. It penalizes variational distributions that deviate too far from the prior.
Relationship Between ELBO and Posterior Approximation
The ELBO connects directly to posterior approximation.
The following identity holds:
Because the KL divergence is nonnegative,
Thus the ELBO is a lower bound on the log evidence.
Maximizing the ELBO is equivalent to minimizing the KL divergence to the true posterior.
When the variational distribution exactly equals the posterior,
the KL divergence becomes zero and the ELBO becomes tight.
Mean-Field Variational Inference
A common approximation is mean-field variational inference.
The variational distribution factorizes across parameters:
The simplest choice uses independent Gaussians:
The variational parameters are therefore
Each parameter has a learned mean and variance.
This approximation is computationally efficient because sampling and KL computations become simple.
Its main weakness is independence. Real posterior distributions often contain strong parameter correlations, but mean-field inference ignores them.
Variational Parameters in Neural Networks
In a Bayesian neural network, every parameter tensor becomes probabilistic.
For a deterministic weight tensor:
Variational inference replaces it with
Using a Gaussian approximation:
The model therefore learns both
- parameter means
- parameter uncertainties
instead of only point estimates.
The parameter count roughly doubles because each weight now requires both a mean and a variance parameter.
The Reparameterization Trick
The ELBO contains expectations over stochastic parameters. To optimize the ELBO using gradient descent, we need differentiable sampling.
The reparameterization trick solves this problem.
Instead of sampling directly from
we sample
then compute
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\theta=\mu+\sigma\epsilon"}}
The randomness is isolated in , which is independent of the learnable parameters.
Gradients can now flow through and .
This trick is central to variational autoencoders, Bayesian neural networks, diffusion models, and many probabilistic deep learning systems.
Monte Carlo Estimation of the ELBO
The ELBO expectation is usually estimated using Monte Carlo sampling.
Suppose we sample
The expected likelihood term becomes
In practice, even one or a few samples per minibatch may work well.
The full stochastic objective becomes
Training therefore resembles ordinary stochastic optimization, except the parameters are sampled from probability distributions.
Variational Bayesian Linear Layer
A variational Bayesian linear layer stores means and variances for weights.
import torch
from torch import nn
import torch.nn.functional as F
class VariationalLinear(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight_mu = nn.Parameter(
torch.zeros(out_features, in_features)
)
self.weight_log_sigma = nn.Parameter(
torch.full((out_features, in_features), -5.0)
)
self.bias_mu = nn.Parameter(
torch.zeros(out_features)
)
self.bias_log_sigma = nn.Parameter(
torch.full((out_features,), -5.0)
)
def sample_weights(self):
weight_eps = torch.randn_like(self.weight_mu)
bias_eps = torch.randn_like(self.bias_mu)
weight_sigma = torch.exp(self.weight_log_sigma)
bias_sigma = torch.exp(self.bias_log_sigma)
weight = self.weight_mu + weight_sigma * weight_eps
bias = self.bias_mu + bias_sigma * bias_eps
return weight, bias
def forward(self, x):
weight, bias = self.sample_weights()
return F.linear(x, weight, bias)Each forward pass samples a different linear transformation.
KL Regularization
The KL term acts as a probabilistic regularizer.
Suppose the prior is
If the variational posterior becomes too large or too complex, the KL penalty increases.
This prevents overfitting by discouraging unnecessarily confident parameter distributions.
Ordinary L2 regularization can therefore be interpreted as a simple Bayesian prior penalty.
Variational inference generalizes this idea from point estimates to full probability distributions.
Variational Free Energy Interpretation
The ELBO is also called variational free energy in statistical physics.
The expected likelihood term encourages accurate data modeling.
The KL term penalizes complexity.
Thus variational inference balances
- data fit
- model simplicity
This resembles many principles in machine learning, including minimum description length and information bottleneck methods.
Variational Inference in Large Models
Modern deep learning systems often use scalable forms of variational inference.
Examples include:
| Method | Idea |
|---|---|
| Variational Bayesian neural networks | Posterior over weights |
| Variational autoencoders | Posterior over latent variables |
| Bayesian transformers | Posterior over transformer parameters |
| Monte Carlo dropout | Approximate variational inference |
| Deep latent variable models | Learned probabilistic representations |
In very large foundation models, full Bayesian inference is usually too expensive. Approximate uncertainty methods such as ensembles, dropout sampling, or probabilistic output layers are more common.
However, the conceptual framework of variational inference remains central to probabilistic deep learning.
Variational Autoencoders as Variational Inference
Variational autoencoders are one of the most successful applications of variational inference.
The encoder approximates a posterior:
The decoder defines a likelihood:
Training maximizes the ELBO:
The latent variable replaces neural network weights as the random variable being inferred.
This demonstrates that variational inference is not limited to Bayesian weights. It is a general framework for approximate probabilistic inference.
Limitations of Variational Inference
Variational inference is scalable and differentiable, but it has limitations.
The variational family may be too simple. Mean-field approximations ignore correlations and may underestimate uncertainty.
Optimization can become unstable in large probabilistic models.
The KL direction
often encourages mode-seeking behavior. The approximation may collapse onto one dominant mode instead of representing the full posterior structure.
Monte Carlo estimates introduce stochastic noise into training.
Despite these limitations, variational inference remains one of the most practical and widely used approaches for approximate Bayesian deep learning.
Summary
Variational inference approximates an intractable posterior distribution using a simpler parameterized distribution.
Instead of exact Bayesian inference, the problem becomes optimization of the evidence lower bound.
The ELBO combines two objectives:
- maximize expected data likelihood
- minimize divergence from the prior
The reparameterization trick enables gradient-based optimization of stochastic models. Mean-field Gaussian approximations provide scalable Bayesian neural networks.
Variational inference forms the foundation of many modern probabilistic deep learning systems, including Bayesian neural networks, variational autoencoders, latent variable models, and uncertainty-aware deep learning architectures.