Probabilistic deep learning adds distributions to ordinary neural networks.
Probabilistic deep learning adds distributions to ordinary neural networks. A model no longer predicts only a value. It predicts parameters of a probability distribution, samples from latent variables, estimates uncertainty, or defines a likelihood for observed data.
In PyTorch, this usually means combining three pieces:
| Piece | Role |
|---|---|
| Neural network | Computes distribution parameters |
| Probability distribution | Defines likelihood or sampling rule |
| Loss function | Optimizes negative log likelihood or ELBO |
The neural network remains deterministic in its computation. The probabilistic part appears in how we interpret its outputs.
Deterministic Versus Probabilistic Outputs
A deterministic regression model predicts
A probabilistic regression model predicts a distribution:
The network outputs both and . The mean gives the central prediction. The variance gives uncertainty.
For classification, the network outputs logits:
The logits define a categorical distribution:
Thus, ordinary classification with cross-entropy already has a probabilistic interpretation.
Using torch.distributions
PyTorch provides probability distributions through torch.distributions.
A normal distribution can be created as follows:
import torch
from torch.distributions import Normal
mu = torch.tensor([0.0])
sigma = torch.tensor([1.0])
dist = Normal(mu, sigma)
sample = dist.sample()
log_prob = dist.log_prob(sample)The method sample() draws random values. The method log_prob() evaluates the log probability density.
This is useful because many probabilistic losses are negative log likelihoods:
Gaussian Regression Model
A Gaussian regression model predicts a mean and a variance.
import torch
from torch import nn
from torch.distributions import Normal
class GaussianRegressor(nn.Module):
def __init__(self, in_features, hidden_features):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, hidden_features),
nn.ReLU(),
)
self.mean = nn.Linear(hidden_features, 1)
self.log_std = nn.Linear(hidden_features, 1)
def forward(self, x):
h = self.net(x)
mean = self.mean(h)
log_std = self.log_std(h).clamp(-10.0, 5.0)
std = log_std.exp()
return Normal(mean, std)The model returns a distribution object, not a raw tensor.
Training uses negative log likelihood:
def gaussian_nll_loss(model, x, y):
dist = model(x)
loss = -dist.log_prob(y).mean()
return lossThis loss rewards accurate means and calibrated variances. A model cannot make the variance arbitrarily large without paying a log-standard-deviation penalty.
Bernoulli Models
For binary prediction, the Bernoulli distribution is common.
A network can output logits:
from torch.distributions import Bernoulli
class BernoulliClassifier(nn.Module):
def __init__(self, in_features, hidden_features):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, 1),
)
def forward(self, x):
logits = self.net(x)
return Bernoulli(logits=logits)The loss is again negative log likelihood:
def bernoulli_nll_loss(model, x, y):
dist = model(x)
return -dist.log_prob(y).mean()This is equivalent to binary cross-entropy with logits, but the probabilistic version makes the modeling assumption explicit.
Categorical Models
Multiclass classification uses the categorical distribution.
from torch.distributions import Categorical
class CategoricalClassifier(nn.Module):
def __init__(self, in_features, hidden_features, num_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, num_classes),
)
def forward(self, x):
logits = self.net(x)
return Categorical(logits=logits)The negative log likelihood is:
def categorical_nll_loss(model, x, y):
dist = model(x)
return -dist.log_prob(y).mean()Here y should contain integer class indices.
This is equivalent to torch.nn.functional.cross_entropy, but the distribution form generalizes more naturally to probabilistic models.
Mixture Density Networks
A single Gaussian may be too simple. Some regression problems are multimodal: the same input can lead to several plausible outputs.
A mixture density network predicts a mixture distribution:
Here:
| Symbol | Meaning |
|---|---|
| Number of mixture components | |
| Mixture weight | |
| Component mean | |
| Component standard deviation |
In PyTorch:
from torch.distributions import Categorical, Normal, MixtureSameFamily
class MixtureDensityNetwork(nn.Module):
def __init__(self, in_features, hidden_features, num_components):
super().__init__()
self.num_components = num_components
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.Tanh(),
nn.Linear(hidden_features, hidden_features),
nn.Tanh(),
)
self.logits = nn.Linear(hidden_features, num_components)
self.means = nn.Linear(hidden_features, num_components)
self.log_stds = nn.Linear(hidden_features, num_components)
def forward(self, x):
h = self.net(x)
logits = self.logits(h)
means = self.means(h)
stds = self.log_stds(h).clamp(-10.0, 5.0).exp()
mixture = Categorical(logits=logits)
components = Normal(means, stds)
return MixtureSameFamily(mixture, components)Training:
def mdn_loss(model, x, y):
dist = model(x)
return -dist.log_prob(y.squeeze(-1)).mean()Mixture density networks are useful when the conditional target distribution has several modes.
Latent Variable Models
A latent variable model introduces an unobserved variable :
The latent variable may represent hidden factors of variation. In an image model, may encode object identity, pose, lighting, or style.
The marginal likelihood is
This integral is usually intractable. Variational inference introduces an approximate posterior:
This is the foundation of variational autoencoders.
Minimal Variational Autoencoder
A variational autoencoder contains an encoder, a decoder, and a latent distribution.
The encoder predicts
The decoder predicts
A minimal PyTorch VAE:
import torch
from torch import nn
from torch.distributions import Normal, Bernoulli
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim, hidden_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
)
self.z_mean = nn.Linear(hidden_dim, latent_dim)
self.z_log_std = nn.Linear(hidden_dim, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim),
)
def encode(self, x):
h = self.encoder(x)
mean = self.z_mean(h)
log_std = self.z_log_std(h).clamp(-10.0, 5.0)
std = log_std.exp()
return Normal(mean, std)
def decode(self, z):
logits = self.decoder(z)
return Bernoulli(logits=logits)
def forward(self, x):
q_z = self.encode(x)
z = q_z.rsample()
p_x = self.decode(z)
return p_x, q_z, zThe method rsample() uses the reparameterization trick, allowing gradients to pass through the sampled latent variable.
VAE Loss
The VAE optimizes the negative ELBO.
For a standard normal prior , the KL term has a closed form when is diagonal Gaussian.
def vae_loss(model, x):
p_x, q_z, z = model(x)
reconstruction_loss = -p_x.log_prob(x).sum(dim=-1).mean()
mean = q_z.loc
std = q_z.scale
var = std.pow(2)
kl = 0.5 * (
mean.pow(2) + var - var.log() - 1.0
).sum(dim=-1).mean()
return reconstruction_loss + klThe reconstruction term trains the decoder. The KL term keeps the approximate posterior close to the prior.
Sampling from a VAE
After training, we can generate samples by drawing from the prior:
then decoding:
In PyTorch:
@torch.no_grad()
def sample_vae(model, num_samples, latent_dim, device):
z = torch.randn(num_samples, latent_dim, device=device)
p_x = model.decode(z)
return p_x.sample()For image data, the generated tensor can be reshaped back into image format.
Normalizing Flows
Normalizing flows transform a simple distribution into a complex distribution using invertible mappings.
Let
and
The density is computed using the change-of-variables formula:
Flows are useful because they provide exact likelihoods and expressive distributions.
Their main constraint is architectural: transformations must be invertible and have tractable Jacobian determinants.
Probabilistic Forecasting
Probabilistic modeling is also useful in time series and forecasting.
Instead of predicting a single future value, the model predicts a distribution over future values:
Common output distributions include:
| Distribution | Use case |
|---|---|
| Gaussian | Continuous targets |
| Student-t | Heavy-tailed noise |
| Negative binomial | Count data |
| Quantile outputs | Prediction intervals |
| Mixture distributions | Multimodal futures |
A probabilistic forecast can answer questions such as:
- What is the expected value?
- What is the 90 percent prediction interval?
- What is the probability of exceeding a threshold?
Choosing a Distribution
The output distribution should match the target type.
| Target type | Common distribution |
|---|---|
| Binary value | Bernoulli |
| Class label | Categorical |
| Real value | Normal, Student-t, mixture of normals |
| Positive value | LogNormal, Gamma |
| Count | Poisson, negative binomial |
| Bounded value | Beta |
| Sequence | Autoregressive categorical or continuous likelihood |
Poor distribution choice can produce misleading uncertainty. For example, a Gaussian likelihood may handle symmetric noise well but may perform poorly on heavy-tailed data.
Training Loop Pattern
Most probabilistic models follow a common training pattern:
def train_step(model, optimizer, x, y):
model.train()
optimizer.zero_grad()
dist = model(x)
loss = -dist.log_prob(y).mean()
loss.backward()
optimizer.step()
return loss.item()For latent variable models, the loss may include extra KL terms:
loss = reconstruction_loss + kl_lossFor Bayesian neural networks, the loss may include a KL term between the variational posterior and the prior.
Numerical Stability
Probabilistic models require careful numerical handling.
Use logits instead of probabilities when possible. For example:
Bernoulli(logits=logits)
Categorical(logits=logits)This avoids unstable calls to log(sigmoid(x)) or log(softmax(x)).
Clamp log standard deviations:
log_std = log_std.clamp(-10.0, 5.0)This prevents standard deviations from becoming zero or exploding.
Add small epsilons when computing logs manually:
torch.log(x + 1e-8)Prefer library distribution methods when possible, since they often implement stable formulas internally.
Evaluation
Probabilistic models should be evaluated with probabilistic metrics.
| Metric | Meaning |
|---|---|
| Negative log likelihood | Quality of predicted distribution |
| Calibration error | Whether probabilities match frequencies |
| Prediction interval coverage | Whether intervals contain true values |
| Sharpness | How concentrated predictions are |
| Brier score | Accuracy of probabilistic classification |
| CRPS | Probabilistic forecasting quality |
Accuracy alone is insufficient. A probabilistic model should be both accurate and calibrated.
Summary
Practical probabilistic modeling in PyTorch usually means making neural networks output probability distributions.
Regression models can output Gaussian or mixture distributions. Classifiers define Bernoulli or categorical distributions. Latent variable models introduce hidden variables and optimize ELBO-style objectives.
The key implementation pattern is simple: the network computes distribution parameters, the distribution defines log_prob, and training minimizes negative log likelihood or negative ELBO.