# Variational Inference

Bayesian neural networks require inference over a posterior distribution:

$$
p(\theta \mid D) =
\frac{p(D \mid \theta)p(\theta)}{p(D)}.
$$

For modern neural networks, this posterior is usually impossible to compute exactly. The parameter space is extremely high-dimensional, the likelihood is nonlinear, and the evidence integral is intractable.

Variational inference transforms this inference problem into an optimization problem. Instead of computing the true posterior directly, we approximate it with a simpler distribution and optimize that approximation.

This is one of the most important ideas in modern probabilistic deep learning.

### The Central Idea

Suppose the true posterior is

$$
p(\theta \mid D).
$$

We introduce a simpler distribution

$$
q_\phi(\theta),
$$

called the variational distribution.

The parameters $\phi$ control the shape of the approximation. Variational inference chooses $\phi$ so that

$$
q_\phi(\theta)
\approx
p(\theta \mid D).
$$

The true posterior may be complicated, multimodal, and correlated. The variational distribution is chosen to be computationally manageable.

The core problem becomes:

$$
\phi^\star =
\arg\min_\phi
\mathrm{KL}(q_\phi(\theta)\|p(\theta \mid D)).
$$

We therefore replace probabilistic integration with optimization.

### The Kullback-Leibler Divergence

Variational inference measures the difference between two probability distributions using the Kullback-Leibler divergence.

For two distributions $q(x)$ and $p(x)$,

genui{"math_block_widget_always_prefetch_v2":{"content":"D_{KL}(q(x)\\|p(x))=\\int q(x)\\log\\frac{q(x)}{p(x)}\\,dx"}}

The KL divergence is always nonnegative:

$$
\mathrm{KL}(q\|p)\geq0.
$$

It equals zero only when the two distributions are identical almost everywhere.

The KL divergence is not symmetric:

$$
\mathrm{KL}(q\|p)
\neq
\mathrm{KL}(p\|q).
$$

This asymmetry matters in variational inference because minimizing

$$
\mathrm{KL}(q_\phi(\theta)\|p(\theta\mid D))
$$

encourages the approximation to concentrate on regions where the posterior density is large.

### Why Direct KL Minimization Is Impossible

The posterior is

$$
p(\theta \mid D) =
\frac{p(D,\theta)}{p(D)},
$$

where

$$
p(D) =
\int p(D,\theta)\,d\theta.
$$

The evidence term $p(D)$ is usually intractable.

The KL divergence becomes

$$
\mathrm{KL}(q_\phi(\theta)\|p(\theta\mid D)) =
\int
q_\phi(\theta)
\log
\frac{q_\phi(\theta)}{p(\theta\mid D)}
\,d\theta.
$$

Substituting Bayes’ rule:

$$ =
\int
q_\phi(\theta)
\log
\frac{q_\phi(\theta)p(D)}
{p(D,\theta)}
\,d\theta.
$$

The evidence $p(D)$ prevents direct optimization.

Variational inference avoids this difficulty by deriving a lower bound on the log evidence.

### The Evidence Lower Bound

The central objective in variational inference is the evidence lower bound, or ELBO.

Starting from

$$
\log p(D),
$$

we insert the variational distribution:

$$
\log p(D) =
\log
\int
q_\phi(\theta)
\frac{p(D,\theta)}
{q_\phi(\theta)}
\,d\theta.
$$

Applying Jensen’s inequality gives

$$
\log p(D)
\geq
\mathbb{E}_{q_\phi(\theta)}
\left[
\log
\frac{p(D,\theta)}
{q_\phi(\theta)}
\right].
$$

This lower bound is the ELBO:

$$
\mathcal{L}(\phi) =
\mathbb{E}_{q_\phi(\theta)}
[
\log p(D,\theta) -
\log q_\phi(\theta)
].
$$

Equivalently,

genui{"math_block_widget_always_prefetch_v2":{"content":"\\mathcal{L}(\\phi)=\\mathbb{E}_{q_\\phi(\\theta)}[\\log p(D\\mid\\theta)]-D_{KL}(q_\\phi(\\theta)\\|p(\\theta))"}}

This form is more intuitive.

The first term is the expected data likelihood. It rewards parameter samples that explain the observed data.

The second term is a regularizer. It penalizes variational distributions that deviate too far from the prior.

### Relationship Between ELBO and Posterior Approximation

The ELBO connects directly to posterior approximation.

The following identity holds:

$$
\log p(D) =
\mathcal{L}(\phi)
+
\mathrm{KL}(q_\phi(\theta)\|p(\theta\mid D)).
$$

Because the KL divergence is nonnegative,

$$
\mathcal{L}(\phi)
\leq
\log p(D).
$$

Thus the ELBO is a lower bound on the log evidence.

Maximizing the ELBO is equivalent to minimizing the KL divergence to the true posterior.

When the variational distribution exactly equals the posterior,

$$
q_\phi(\theta) =
p(\theta\mid D),
$$

the KL divergence becomes zero and the ELBO becomes tight.

### Mean-Field Variational Inference

A common approximation is mean-field variational inference.

The variational distribution factorizes across parameters:

$$
q_\phi(\theta) =
\prod_j q_{\phi_j}(\theta_j).
$$

The simplest choice uses independent Gaussians:

$$
q_\phi(\theta_j) =
\mathcal{N}(\theta_j;\mu_j,\sigma_j^2).
$$

The variational parameters are therefore

$$
\phi =
\{\mu_j,\sigma_j\}.
$$

Each parameter has a learned mean and variance.

This approximation is computationally efficient because sampling and KL computations become simple.

Its main weakness is independence. Real posterior distributions often contain strong parameter correlations, but mean-field inference ignores them.

### Variational Parameters in Neural Networks

In a Bayesian neural network, every parameter tensor becomes probabilistic.

For a deterministic weight tensor:

$$
W.
$$

Variational inference replaces it with

$$
W \sim q_\phi(W).
$$

Using a Gaussian approximation:

$$
W_{ij}
\sim
\mathcal{N}(\mu_{ij}, \sigma_{ij}^2).
$$

The model therefore learns both

- parameter means
- parameter uncertainties

instead of only point estimates.

The parameter count roughly doubles because each weight now requires both a mean and a variance parameter.

### The Reparameterization Trick

The ELBO contains expectations over stochastic parameters. To optimize the ELBO using gradient descent, we need differentiable sampling.

The reparameterization trick solves this problem.

Instead of sampling directly from

$$
\theta
\sim
\mathcal{N}(\mu,\sigma^2),
$$

we sample

$$
\epsilon
\sim
\mathcal{N}(0,1),
$$

then compute

genui{"math_block_widget_always_prefetch_v2":{"content":"\\theta=\\mu+\\sigma\\epsilon"}}

The randomness is isolated in $\epsilon$, which is independent of the learnable parameters.

Gradients can now flow through $\mu$ and $\sigma$.

This trick is central to variational autoencoders, Bayesian neural networks, diffusion models, and many probabilistic deep learning systems.

### Monte Carlo Estimation of the ELBO

The ELBO expectation is usually estimated using Monte Carlo sampling.

Suppose we sample

$$
\theta^{(1)},\ldots,\theta^{(S)}
\sim
q_\phi(\theta).
$$

The expected likelihood term becomes

$$
\mathbb{E}_{q_\phi(\theta)}
[
\log p(D\mid\theta)
]
\approx
\frac{1}{S}
\sum_{s=1}^{S}
\log p(D\mid\theta^{(s)}).
$$

In practice, even one or a few samples per minibatch may work well.

The full stochastic objective becomes

$$
\mathcal{L}(\phi)
\approx
\frac{1}{S}
\sum_{s=1}^{S}
\log p(D\mid\theta^{(s)}) -
\mathrm{KL}(q_\phi(\theta)\|p(\theta)).
$$

Training therefore resembles ordinary stochastic optimization, except the parameters are sampled from probability distributions.

### Variational Bayesian Linear Layer

A variational Bayesian linear layer stores means and variances for weights.

```python id="5d4n2x"
import torch
from torch import nn
import torch.nn.functional as F

class VariationalLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()

        self.weight_mu = nn.Parameter(
            torch.zeros(out_features, in_features)
        )

        self.weight_log_sigma = nn.Parameter(
            torch.full((out_features, in_features), -5.0)
        )

        self.bias_mu = nn.Parameter(
            torch.zeros(out_features)
        )

        self.bias_log_sigma = nn.Parameter(
            torch.full((out_features,), -5.0)
        )

    def sample_weights(self):
        weight_eps = torch.randn_like(self.weight_mu)
        bias_eps = torch.randn_like(self.bias_mu)

        weight_sigma = torch.exp(self.weight_log_sigma)
        bias_sigma = torch.exp(self.bias_log_sigma)

        weight = self.weight_mu + weight_sigma * weight_eps
        bias = self.bias_mu + bias_sigma * bias_eps

        return weight, bias

    def forward(self, x):
        weight, bias = self.sample_weights()
        return F.linear(x, weight, bias)
```

Each forward pass samples a different linear transformation.

### KL Regularization

The KL term acts as a probabilistic regularizer.

Suppose the prior is

$$
p(\theta) =
\mathcal{N}(0,1).
$$

If the variational posterior becomes too large or too complex, the KL penalty increases.

This prevents overfitting by discouraging unnecessarily confident parameter distributions.

Ordinary L2 regularization can therefore be interpreted as a simple Bayesian prior penalty.

Variational inference generalizes this idea from point estimates to full probability distributions.

### Variational Free Energy Interpretation

The ELBO is also called variational free energy in statistical physics.

The expected likelihood term encourages accurate data modeling.

The KL term penalizes complexity.

Thus variational inference balances

- data fit
- model simplicity

This resembles many principles in machine learning, including minimum description length and information bottleneck methods.

### Variational Inference in Large Models

Modern deep learning systems often use scalable forms of variational inference.

Examples include:

| Method | Idea |
|---|---|
| Variational Bayesian neural networks | Posterior over weights |
| Variational autoencoders | Posterior over latent variables |
| Bayesian transformers | Posterior over transformer parameters |
| Monte Carlo dropout | Approximate variational inference |
| Deep latent variable models | Learned probabilistic representations |

In very large foundation models, full Bayesian inference is usually too expensive. Approximate uncertainty methods such as ensembles, dropout sampling, or probabilistic output layers are more common.

However, the conceptual framework of variational inference remains central to probabilistic deep learning.

### Variational Autoencoders as Variational Inference

Variational autoencoders are one of the most successful applications of variational inference.

The encoder approximates a posterior:

$$
q_\phi(z\mid x).
$$

The decoder defines a likelihood:

$$
p_\theta(x\mid z).
$$

Training maximizes the ELBO:

$$
\mathbb{E}_{q_\phi(z\mid x)}
[
\log p_\theta(x\mid z)
] -
\mathrm{KL}(q_\phi(z\mid x)\|p(z)).
$$

The latent variable $z$ replaces neural network weights as the random variable being inferred.

This demonstrates that variational inference is not limited to Bayesian weights. It is a general framework for approximate probabilistic inference.

### Limitations of Variational Inference

Variational inference is scalable and differentiable, but it has limitations.

The variational family may be too simple. Mean-field approximations ignore correlations and may underestimate uncertainty.

Optimization can become unstable in large probabilistic models.

The KL direction

$$
\mathrm{KL}(q\|p)
$$

often encourages mode-seeking behavior. The approximation may collapse onto one dominant mode instead of representing the full posterior structure.

Monte Carlo estimates introduce stochastic noise into training.

Despite these limitations, variational inference remains one of the most practical and widely used approaches for approximate Bayesian deep learning.

### Summary

Variational inference approximates an intractable posterior distribution using a simpler parameterized distribution.

Instead of exact Bayesian inference, the problem becomes optimization of the evidence lower bound.

The ELBO combines two objectives:

- maximize expected data likelihood
- minimize divergence from the prior

The reparameterization trick enables gradient-based optimization of stochastic models. Mean-field Gaussian approximations provide scalable Bayesian neural networks.

Variational inference forms the foundation of many modern probabilistic deep learning systems, including Bayesian neural networks, variational autoencoders, latent variable models, and uncertainty-aware deep learning architectures.

