Monte Carlo Methods

Monte Carlo methods approximate difficult mathematical quantities using random samples. In probabilistic deep learning, they are used when expectations, integrals, or posterior predictive distributions cannot be computed exactly.

The basic idea is simple. If a quantity is an expectation under a distribution, we can estimate it by drawing samples from that distribution and averaging the result.

Expectations as Averages

Many probabilistic learning problems require computing an expectation:

$$ \mathbb{E}_{p(z)}[f(z)] = \int f(z)p(z),dz. $$

When this integral is intractable, Monte Carlo approximation uses samples

$$ z^{(1)},z^{(2)},\ldots,z^{(S)} \sim p(z) $$

and estimates

$$ \mathbb{E}{p(z)}[f(z)] \approx \frac{1}{S} \sum{s=1}^{S} f(z^{(s)}). $$

As the number of samples increases, the estimate usually becomes more accurate.

Monte Carlo Error

The error of a Monte Carlo estimate decreases at rate

$$ O(S^{-1/2}), $$

where $S$ is the number of samples.

This rate is slow, but it has an important property: it does not directly depend on dimension. That makes Monte Carlo methods useful for high-dimensional probabilistic models, including Bayesian neural networks and latent variable models.

If the samples are independent, the variance of the sample mean is

$$ \mathrm{Var} \left[ \frac{1}{S} \sum_{s=1}^{S} f(z^{(s)}) \right] = \frac{\mathrm{Var}[f(z)]}{S}. $$

Thus, more samples reduce estimator variance.

Monte Carlo Prediction

For a Bayesian neural network, the posterior predictive distribution is

$$ p(y^\star \mid x^\star, D) = \int p(y^\star \mid x^\star,\theta)p(\theta\mid D),d\theta. $$

This integral is usually intractable.

Monte Carlo prediction approximates it using posterior samples:

$$ \theta^{(1)},\ldots,\theta^{(S)} \sim p(\theta\mid D). $$

Then

$$ p(y^\star \mid x^\star, D) \approx \frac{1}{S} \sum_{s=1}^{S} p(y^\star \mid x^\star,\theta^{(s)}). $$

For regression, we average predicted values:

$$ \hat{y} = \frac{1}{S} \sum_{s=1}^{S} f_{\theta^{(s)}}(x^\star). $$

The empirical variance across samples gives an uncertainty estimate.

Simple Monte Carlo in PyTorch

Suppose a model produces stochastic predictions because it samples weights, uses dropout, or samples latent variables.

A Monte Carlo prediction loop is straightforward:

import torch

@torch.no_grad()
def monte_carlo_predict(model, x, num_samples=50):
    preds = []

    for _ in range(num_samples):
        y = model(x)
        preds.append(y)

    preds = torch.stack(preds, dim=0)

    mean = preds.mean(dim=0)
    std = preds.std(dim=0)

    return mean, std

The returned mean is the prediction. The returned standard deviation measures predictive spread.

For classification, we usually average probabilities rather than logits:

import torch
import torch.nn.functional as F

@torch.no_grad()
def monte_carlo_classify(model, x, num_samples=50):
    probs = []

    for _ in range(num_samples):
        logits = model(x)
        probs.append(F.softmax(logits, dim=-1))

    probs = torch.stack(probs, dim=0)

    mean_probs = probs.mean(dim=0)
    uncertainty = probs.std(dim=0)

    return mean_probs, uncertainty

Averaging logits can give a different result from averaging probabilities. Bayesian prediction averages predictive distributions, so probability averaging is usually the safer choice.

Importance Sampling

Sometimes we need an expectation under a distribution $p(z)$, but sampling from $p$ is difficult. Importance sampling draws samples from another distribution $q(z)$, called the proposal distribution.

Starting with

$$ \mathbb{E}_{p(z)}[f(z)] = \int f(z)p(z),dz, $$

we multiply and divide by $q(z)$:

$$= \int f(z)\frac{p(z)}{q(z)}q(z),dz.$$

Therefore,

$$ \mathbb{E}{p(z)}[f(z)] = \mathbb{E}{q(z)} \left[ f(z)\frac{p(z)}{q(z)} \right]. $$

The Monte Carlo estimate is

$$ \frac{1}{S} \sum_{s=1}^{S} f(z^{(s)}) \frac{p(z^{(s)})}{q(z^{(s)})}, \quad z^{(s)}\sim q(z). $$

The ratio

$$ w(z)=\frac{p(z)}{q(z)} $$

is called the importance weight.

Importance sampling works well only when $q$ places probability mass where $p$ and $f$ are large. Poor proposal distributions cause high variance.

Self-Normalized Importance Sampling

In many Bayesian problems, the target distribution is known only up to a constant:

$$ p(z) = \frac{\tilde{p}(z)}{Z}. $$

The normalizing constant $Z$ may be unknown. Self-normalized importance sampling avoids computing it.

Define unnormalized weights

$$ \tilde{w}^{(s)} = \frac{\tilde{p}(z^{(s)})}{q(z^{(s)})}. $$

Then normalize them:

$$ w^{(s)} = \frac{\tilde{w}^{(s)}}{\sum_{r=1}^{S}\tilde{w}^{(r)}}. $$

The estimate becomes

$$ \sum_{s=1}^{S} w^{(s)} f(z^{(s)}). $$

This estimator is biased for finite $S$, but the bias decreases as the sample count increases.

Markov Chain Monte Carlo

Markov chain Monte Carlo, or MCMC, constructs a Markov chain whose stationary distribution is the target distribution.

Instead of drawing independent samples directly from $p(z)$, MCMC generates a sequence

$$ z^{(1)},z^{(2)},\ldots,z^{(S)}. $$

After a warmup period, the samples approximate draws from the target distribution.

MCMC is useful when the target density can be evaluated up to a constant but direct sampling is hard.

In Bayesian deep learning, the target distribution is often the posterior:

$$ p(\theta\mid D) \propto p(D\mid\theta)p(\theta). $$

MCMC can sample from this posterior in principle. In large neural networks, exact MCMC is usually expensive, but it remains important conceptually and in smaller models.

Metropolis-Hastings

Metropolis-Hastings is a general MCMC algorithm.

Suppose the current state is $z$. A proposal distribution suggests a new state:

$$ z' \sim q(z'\mid z). $$

The algorithm accepts the proposal with probability

$$ \alpha = \min \left( 1, \frac{p(z')q(z\mid z')} {p(z)q(z'\mid z)} \right). $$

If the proposal is accepted, the chain moves to $z'$. If rejected, it stays at $z$.

The algorithm only requires evaluating $p(z)$ up to a proportional constant. This makes it suitable for Bayesian posteriors where the evidence term is unknown.

Langevin Monte Carlo

Langevin Monte Carlo improves sampling by using gradients of the log density.

Instead of proposing random moves blindly, it moves toward regions of higher probability:

$$ z' = z + \frac{\epsilon^2}{2} \nabla_z \log p(z) + \epsilon \xi, \quad \xi\sim\mathcal{N}(0,I). $$

The gradient term guides the sampler. The noise term maintains exploration.

In Bayesian neural networks, the gradient of the log posterior is

$$ \nabla_\theta \log p(\theta\mid D) = \nabla_\theta \log p(D\mid\theta) + \nabla_\theta \log p(\theta). $$

This resembles ordinary neural network training, with added sampling noise.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo, or HMC, uses auxiliary momentum variables to explore probability distributions more efficiently.

It simulates dynamics based on an energy function. For a target density $p(z)$, define potential energy

$$ U(z)=-\log p(z). $$

HMC introduces momentum $r$ with kinetic energy

$$ K(r)=\frac{1}{2}r^\top M^{-1}r. $$

The total energy is

$$ H(z,r)=U(z)+K(r). $$

By approximately simulating Hamiltonian dynamics, HMC can move long distances through parameter space while maintaining high acceptance probability.

HMC is powerful for many Bayesian models, but full HMC is usually too expensive for very large neural networks.

Stochastic Gradient MCMC

Large datasets make full-gradient MCMC costly. Stochastic gradient MCMC uses minibatch gradients instead.

A common method is stochastic gradient Langevin dynamics, or SGLD:

$$ \theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \log p(\theta_t\mid D_t) + \eta_t, $$

where $D_t$ is a minibatch and

$$ \eta_t\sim\mathcal{N}(0,\epsilon_t I). $$

SGLD resembles stochastic gradient descent with carefully scaled Gaussian noise.

With suitable learning rate schedules, SGLD can approximate posterior sampling rather than converge to a single optimum.

Monte Carlo Dropout

Dropout is usually introduced as a regularization technique. In probabilistic deep learning, dropout can also be interpreted as approximate Bayesian inference.

During ordinary inference, dropout is disabled. In Monte Carlo dropout, dropout remains active at inference time.

Each forward pass samples a different dropout mask. This gives a different subnetwork:

$$ f^{(1)}(x), f^{(2)}(x), \ldots, f^{(S)}(x). $$

Predictions are averaged:

import torch
from torch import nn
def enable_dropout(model):
    for module in model.modules():
        if isinstance(module, nn.Dropout):
            module.train()
@torch.no_grad()
def mc_dropout_predict(model, x, num_samples=50):
    model.eval()
    enable_dropout(model)
    probs = []
    for _ in range(num_samples):
        logits = model(x)
        probs.append(torch.softmax(logits, dim=-1))
    probs = torch.stack(probs, dim=0)
    return probs.mean(dim=0), probs.std(dim=0)

$$ \hat{p}(y\mid x) = \frac{1}{S} \sum_{s=1}^{S} p^{(s)}(y\mid x). $$

In PyTorch, dropout is active when the model is in training mode. To use Monte Carlo dropout during inference, we must enable dropout while avoiding gradient computation.

This method is easy to implement, but its uncertainty estimates are approximate.

Variance Reduction

Monte Carlo estimators can have high variance. Variance reduction methods improve estimator quality without simply increasing sample count.

Common techniques include:

Method	Idea
Antithetic sampling	Use negatively correlated samples
Control variates	Subtract a correlated quantity with known expectation
Stratified sampling	Divide sample space into regions and sample each region
Importance sampling	Sample more often from important regions
Rao-Blackwellization	Analytically integrate part of the randomness

In deep learning, control variates are especially important in gradient estimation for stochastic computation graphs and reinforcement learning.

Monte Carlo Gradients

Some models contain random variables that affect the loss. We then need gradients of expectations:

$$ \nabla_\phi \mathbb{E}{q\phi(z)}[f(z)]. $$

There are two common strategies.

The first is the reparameterization gradient. If

$$ z = g_\phi(\epsilon), \quad \epsilon\sim p(\epsilon), $$

then

$$ \nabla_\phi \mathbb{E}{q\phi(z)}[f(z)] = \nabla_\phi \mathbb{E}{p(\epsilon)}[f(g\phi(\epsilon))]. $$

This is used in variational autoencoders and Bayesian neural networks.

The second is the score-function estimator:

$$ \nabla_\phi \mathbb{E}{q\phi(z)}[f(z)] = \mathbb{E}{q\phi(z)} [ f(z)\nabla_\phi \log q_\phi(z) ]. $$

This estimator works for discrete variables, but it often has high variance. It appears in reinforcement learning as the policy gradient estimator.

Monte Carlo in Probabilistic Deep Learning

Monte Carlo methods appear throughout modern deep learning:

Area	Monte Carlo role
Bayesian neural networks	Approximate posterior predictive distributions
Variational inference	Estimate ELBO expectations
Variational autoencoders	Sample latent variables
Diffusion models	Sample reverse stochastic processes
Reinforcement learning	Estimate returns and policy gradients
Dropout uncertainty	Sample subnetworks
Ensembles	Average over trained models
Probabilistic forecasting	Estimate predictive intervals

The common pattern is the same: replace an exact expectation with a sample average.

Practical Guidance

Monte Carlo methods are simple but easy to misuse.

Use enough samples to stabilize the estimate. For rough uncertainty, 10 to 30 samples may be enough. For calibrated uncertainty, more samples may be needed.

Average probabilities, not class labels.

For classification, measure uncertainty using entropy, predictive variance, or mutual information rather than relying only on the maximum probability.

For stochastic layers, separate training randomness from inference-time sampling. Some layers, such as dropout and batch normalization, behave differently in train and eval modes.

Track compute cost. Monte Carlo inference multiplies serving cost by the number of samples.

Summary

Monte Carlo methods approximate expectations and integrals using random samples.

They are central to probabilistic deep learning because exact inference is usually intractable. Bayesian prediction, variational inference, dropout uncertainty, latent variable models, and reinforcement learning all rely on Monte Carlo estimation.

The main advantage is generality. The same sample-average principle applies to many models. The main cost is variance. More samples improve accuracy, but also increase computation.