Monte Carlo methods approximate difficult mathematical quantities using random samples.
Monte Carlo methods approximate difficult mathematical quantities using random samples. In probabilistic deep learning, they are used when expectations, integrals, or posterior predictive distributions cannot be computed exactly.
The basic idea is simple. If a quantity is an expectation under a distribution, we can estimate it by drawing samples from that distribution and averaging the result.
Expectations as Averages
Many probabilistic learning problems require computing an expectation:
When this integral is intractable, Monte Carlo approximation uses samples
and estimates
As the number of samples increases, the estimate usually becomes more accurate.
Monte Carlo Error
The error of a Monte Carlo estimate decreases at rate
where is the number of samples.
This rate is slow, but it has an important property: it does not directly depend on dimension. That makes Monte Carlo methods useful for high-dimensional probabilistic models, including Bayesian neural networks and latent variable models.
If the samples are independent, the variance of the sample mean is
Thus, more samples reduce estimator variance.
Monte Carlo Prediction
For a Bayesian neural network, the posterior predictive distribution is
This integral is usually intractable.
Monte Carlo prediction approximates it using posterior samples:
Then
For regression, we average predicted values:
The empirical variance across samples gives an uncertainty estimate.
Simple Monte Carlo in PyTorch
Suppose a model produces stochastic predictions because it samples weights, uses dropout, or samples latent variables.
A Monte Carlo prediction loop is straightforward:
import torch
@torch.no_grad()
def monte_carlo_predict(model, x, num_samples=50):
preds = []
for _ in range(num_samples):
y = model(x)
preds.append(y)
preds = torch.stack(preds, dim=0)
mean = preds.mean(dim=0)
std = preds.std(dim=0)
return mean, stdThe returned mean is the prediction. The returned standard deviation measures predictive spread.
For classification, we usually average probabilities rather than logits:
import torch
import torch.nn.functional as F
@torch.no_grad()
def monte_carlo_classify(model, x, num_samples=50):
probs = []
for _ in range(num_samples):
logits = model(x)
probs.append(F.softmax(logits, dim=-1))
probs = torch.stack(probs, dim=0)
mean_probs = probs.mean(dim=0)
uncertainty = probs.std(dim=0)
return mean_probs, uncertaintyAveraging logits can give a different result from averaging probabilities. Bayesian prediction averages predictive distributions, so probability averaging is usually the safer choice.
Importance Sampling
Sometimes we need an expectation under a distribution , but sampling from is difficult. Importance sampling draws samples from another distribution , called the proposal distribution.
Starting with
we multiply and divide by :
Therefore,
The Monte Carlo estimate is
The ratio
is called the importance weight.
Importance sampling works well only when places probability mass where and are large. Poor proposal distributions cause high variance.
Self-Normalized Importance Sampling
In many Bayesian problems, the target distribution is known only up to a constant:
The normalizing constant may be unknown. Self-normalized importance sampling avoids computing it.
Define unnormalized weights
Then normalize them:
The estimate becomes
This estimator is biased for finite , but the bias decreases as the sample count increases.
Markov Chain Monte Carlo
Markov chain Monte Carlo, or MCMC, constructs a Markov chain whose stationary distribution is the target distribution.
Instead of drawing independent samples directly from , MCMC generates a sequence
After a warmup period, the samples approximate draws from the target distribution.
MCMC is useful when the target density can be evaluated up to a constant but direct sampling is hard.
In Bayesian deep learning, the target distribution is often the posterior:
MCMC can sample from this posterior in principle. In large neural networks, exact MCMC is usually expensive, but it remains important conceptually and in smaller models.
Metropolis-Hastings
Metropolis-Hastings is a general MCMC algorithm.
Suppose the current state is . A proposal distribution suggests a new state:
The algorithm accepts the proposal with probability
If the proposal is accepted, the chain moves to . If rejected, it stays at .
The algorithm only requires evaluating up to a proportional constant. This makes it suitable for Bayesian posteriors where the evidence term is unknown.
Langevin Monte Carlo
Langevin Monte Carlo improves sampling by using gradients of the log density.
Instead of proposing random moves blindly, it moves toward regions of higher probability:
The gradient term guides the sampler. The noise term maintains exploration.
In Bayesian neural networks, the gradient of the log posterior is
This resembles ordinary neural network training, with added sampling noise.
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo, or HMC, uses auxiliary momentum variables to explore probability distributions more efficiently.
It simulates dynamics based on an energy function. For a target density , define potential energy
HMC introduces momentum with kinetic energy
The total energy is
By approximately simulating Hamiltonian dynamics, HMC can move long distances through parameter space while maintaining high acceptance probability.
HMC is powerful for many Bayesian models, but full HMC is usually too expensive for very large neural networks.
Stochastic Gradient MCMC
Large datasets make full-gradient MCMC costly. Stochastic gradient MCMC uses minibatch gradients instead.
A common method is stochastic gradient Langevin dynamics, or SGLD:
where is a minibatch and
SGLD resembles stochastic gradient descent with carefully scaled Gaussian noise.
With suitable learning rate schedules, SGLD can approximate posterior sampling rather than converge to a single optimum.
Monte Carlo Dropout
Dropout is usually introduced as a regularization technique. In probabilistic deep learning, dropout can also be interpreted as approximate Bayesian inference.
During ordinary inference, dropout is disabled. In Monte Carlo dropout, dropout remains active at inference time.
Each forward pass samples a different dropout mask. This gives a different subnetwork:
Predictions are averaged:
In PyTorch, dropout is active when the model is in training mode. To use Monte Carlo dropout during inference, we must enable dropout while avoiding gradient computation.
import torch
from torch import nn
def enable_dropout(model):
for module in model.modules():
if isinstance(module, nn.Dropout):
module.train()
@torch.no_grad()
def mc_dropout_predict(model, x, num_samples=50):
model.eval()
enable_dropout(model)
probs = []
for _ in range(num_samples):
logits = model(x)
probs.append(torch.softmax(logits, dim=-1))
probs = torch.stack(probs, dim=0)
return probs.mean(dim=0), probs.std(dim=0)This method is easy to implement, but its uncertainty estimates are approximate.
Variance Reduction
Monte Carlo estimators can have high variance. Variance reduction methods improve estimator quality without simply increasing sample count.
Common techniques include:
| Method | Idea |
|---|---|
| Antithetic sampling | Use negatively correlated samples |
| Control variates | Subtract a correlated quantity with known expectation |
| Stratified sampling | Divide sample space into regions and sample each region |
| Importance sampling | Sample more often from important regions |
| Rao-Blackwellization | Analytically integrate part of the randomness |
In deep learning, control variates are especially important in gradient estimation for stochastic computation graphs and reinforcement learning.
Monte Carlo Gradients
Some models contain random variables that affect the loss. We then need gradients of expectations:
There are two common strategies.
The first is the reparameterization gradient. If
then
This is used in variational autoencoders and Bayesian neural networks.
The second is the score-function estimator:
This estimator works for discrete variables, but it often has high variance. It appears in reinforcement learning as the policy gradient estimator.
Monte Carlo in Probabilistic Deep Learning
Monte Carlo methods appear throughout modern deep learning:
| Area | Monte Carlo role |
|---|---|
| Bayesian neural networks | Approximate posterior predictive distributions |
| Variational inference | Estimate ELBO expectations |
| Variational autoencoders | Sample latent variables |
| Diffusion models | Sample reverse stochastic processes |
| Reinforcement learning | Estimate returns and policy gradients |
| Dropout uncertainty | Sample subnetworks |
| Ensembles | Average over trained models |
| Probabilistic forecasting | Estimate predictive intervals |
The common pattern is the same: replace an exact expectation with a sample average.
Practical Guidance
Monte Carlo methods are simple but easy to misuse.
Use enough samples to stabilize the estimate. For rough uncertainty, 10 to 30 samples may be enough. For calibrated uncertainty, more samples may be needed.
Average probabilities, not class labels.
For classification, measure uncertainty using entropy, predictive variance, or mutual information rather than relying only on the maximum probability.
For stochastic layers, separate training randomness from inference-time sampling. Some layers, such as dropout and batch normalization, behave differently in train and eval modes.
Track compute cost. Monte Carlo inference multiplies serving cost by the number of samples.
Summary
Monte Carlo methods approximate expectations and integrals using random samples.
They are central to probabilistic deep learning because exact inference is usually intractable. Bayesian prediction, variational inference, dropout uncertainty, latent variable models, and reinforcement learning all rely on Monte Carlo estimation.
The main advantage is generality. The same sample-average principle applies to many models. The main cost is variance. More samples improve accuracy, but also increase computation.