Skip to content

Practical Probabilistic Modeling in PyTorch

Probabilistic deep learning adds distributions to ordinary neural networks.

Probabilistic deep learning adds distributions to ordinary neural networks. A model no longer predicts only a value. It predicts parameters of a probability distribution, samples from latent variables, estimates uncertainty, or defines a likelihood for observed data.

In PyTorch, this usually means combining three pieces:

PieceRole
Neural networkComputes distribution parameters
Probability distributionDefines likelihood or sampling rule
Loss functionOptimizes negative log likelihood or ELBO

The neural network remains deterministic in its computation. The probabilistic part appears in how we interpret its outputs.

Deterministic Versus Probabilistic Outputs

A deterministic regression model predicts

y^=fθ(x). \hat{y}=f_\theta(x).

A probabilistic regression model predicts a distribution:

p(yx)=N(y;μθ(x),σθ2(x)). p(y\mid x) = \mathcal{N}(y;\mu_\theta(x),\sigma_\theta^2(x)).

The network outputs both μθ(x)\mu_\theta(x) and σθ2(x)\sigma_\theta^2(x). The mean gives the central prediction. The variance gives uncertainty.

For classification, the network outputs logits:

z=fθ(x). z=f_\theta(x).

The logits define a categorical distribution:

p(y=kx)=softmax(z)k. p(y=k\mid x) = \mathrm{softmax}(z)_k.

Thus, ordinary classification with cross-entropy already has a probabilistic interpretation.

Using torch.distributions

PyTorch provides probability distributions through torch.distributions.

A normal distribution can be created as follows:

import torch
from torch.distributions import Normal

mu = torch.tensor([0.0])
sigma = torch.tensor([1.0])

dist = Normal(mu, sigma)

sample = dist.sample()
log_prob = dist.log_prob(sample)

The method sample() draws random values. The method log_prob() evaluates the log probability density.

This is useful because many probabilistic losses are negative log likelihoods:

L=logp(yx). \mathcal{L} = -\log p(y\mid x).

Gaussian Regression Model

A Gaussian regression model predicts a mean and a variance.

import torch
from torch import nn
from torch.distributions import Normal

class GaussianRegressor(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.ReLU(),
        )

        self.mean = nn.Linear(hidden_features, 1)
        self.log_std = nn.Linear(hidden_features, 1)

    def forward(self, x):
        h = self.net(x)

        mean = self.mean(h)
        log_std = self.log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)

The model returns a distribution object, not a raw tensor.

Training uses negative log likelihood:

def gaussian_nll_loss(model, x, y):
    dist = model(x)
    loss = -dist.log_prob(y).mean()
    return loss

This loss rewards accurate means and calibrated variances. A model cannot make the variance arbitrarily large without paying a log-standard-deviation penalty.

Bernoulli Models

For binary prediction, the Bernoulli distribution is common.

A network can output logits:

from torch.distributions import Bernoulli

class BernoulliClassifier(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, 1),
        )

    def forward(self, x):
        logits = self.net(x)
        return Bernoulli(logits=logits)

The loss is again negative log likelihood:

def bernoulli_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()

This is equivalent to binary cross-entropy with logits, but the probabilistic version makes the modeling assumption explicit.

Categorical Models

Multiclass classification uses the categorical distribution.

from torch.distributions import Categorical

class CategoricalClassifier(nn.Module):
    def __init__(self, in_features, hidden_features, num_classes):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, num_classes),
        )

    def forward(self, x):
        logits = self.net(x)
        return Categorical(logits=logits)

The negative log likelihood is:

def categorical_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()

Here y should contain integer class indices.

This is equivalent to torch.nn.functional.cross_entropy, but the distribution form generalizes more naturally to probabilistic models.

Mixture Density Networks

A single Gaussian may be too simple. Some regression problems are multimodal: the same input can lead to several plausible outputs.

A mixture density network predicts a mixture distribution:

p(yx)=k=1Kπk(x)N(y;μk(x),σk2(x)). p(y\mid x) = \sum_{k=1}^{K} \pi_k(x) \mathcal{N}(y;\mu_k(x),\sigma_k^2(x)).

Here:

SymbolMeaning
KKNumber of mixture components
πk(x)\pi_k(x)Mixture weight
μk(x)\mu_k(x)Component mean
σk(x)\sigma_k(x)Component standard deviation

In PyTorch:

from torch.distributions import Categorical, Normal, MixtureSameFamily

class MixtureDensityNetwork(nn.Module):
    def __init__(self, in_features, hidden_features, num_components):
        super().__init__()

        self.num_components = num_components

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Tanh(),
            nn.Linear(hidden_features, hidden_features),
            nn.Tanh(),
        )

        self.logits = nn.Linear(hidden_features, num_components)
        self.means = nn.Linear(hidden_features, num_components)
        self.log_stds = nn.Linear(hidden_features, num_components)

    def forward(self, x):
        h = self.net(x)

        logits = self.logits(h)
        means = self.means(h)
        stds = self.log_stds(h).clamp(-10.0, 5.0).exp()

        mixture = Categorical(logits=logits)
        components = Normal(means, stds)

        return MixtureSameFamily(mixture, components)

Training:

def mdn_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y.squeeze(-1)).mean()

Mixture density networks are useful when the conditional target distribution has several modes.

Latent Variable Models

A latent variable model introduces an unobserved variable zz:

pθ(x,z)=pθ(xz)p(z). p_\theta(x,z) = p_\theta(x\mid z)p(z).

The latent variable may represent hidden factors of variation. In an image model, zz may encode object identity, pose, lighting, or style.

The marginal likelihood is

pθ(x)=pθ(xz)p(z)dz. p_\theta(x) = \int p_\theta(x\mid z)p(z)\,dz.

This integral is usually intractable. Variational inference introduces an approximate posterior:

qϕ(zx). q_\phi(z\mid x).

This is the foundation of variational autoencoders.

Minimal Variational Autoencoder

A variational autoencoder contains an encoder, a decoder, and a latent distribution.

The encoder predicts

qϕ(zx)=N(z;μϕ(x),σϕ2(x)). q_\phi(z\mid x) = \mathcal{N}(z;\mu_\phi(x),\sigma_\phi^2(x)).

The decoder predicts

pθ(xz). p_\theta(x\mid z).

A minimal PyTorch VAE:

import torch
from torch import nn
from torch.distributions import Normal, Bernoulli
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
        )

        self.z_mean = nn.Linear(hidden_dim, latent_dim)
        self.z_log_std = nn.Linear(hidden_dim, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def encode(self, x):
        h = self.encoder(x)
        mean = self.z_mean(h)
        log_std = self.z_log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)

    def decode(self, z):
        logits = self.decoder(z)
        return Bernoulli(logits=logits)

    def forward(self, x):
        q_z = self.encode(x)
        z = q_z.rsample()
        p_x = self.decode(z)

        return p_x, q_z, z

The method rsample() uses the reparameterization trick, allowing gradients to pass through the sampled latent variable.

VAE Loss

The VAE optimizes the negative ELBO.

L=Eqϕ(zx)[logpθ(xz)]+DKL(qϕ(zx)p(z)). \mathcal{L} = -\mathbb{E}_{q_\phi(z\mid x)} [ \log p_\theta(x\mid z) ] + D_{KL}(q_\phi(z\mid x)\|p(z)).

For a standard normal prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I), the KL term has a closed form when qϕq_\phi is diagonal Gaussian.

def vae_loss(model, x):
    p_x, q_z, z = model(x)

    reconstruction_loss = -p_x.log_prob(x).sum(dim=-1).mean()

    mean = q_z.loc
    std = q_z.scale
    var = std.pow(2)

    kl = 0.5 * (
        mean.pow(2) + var - var.log() - 1.0
    ).sum(dim=-1).mean()

    return reconstruction_loss + kl

The reconstruction term trains the decoder. The KL term keeps the approximate posterior close to the prior.

Sampling from a VAE

After training, we can generate samples by drawing from the prior:

zN(0,I), z\sim\mathcal{N}(0,I),

then decoding:

xpθ(xz). x\sim p_\theta(x\mid z).

In PyTorch:

@torch.no_grad()
def sample_vae(model, num_samples, latent_dim, device):
    z = torch.randn(num_samples, latent_dim, device=device)
    p_x = model.decode(z)

    return p_x.sample()

For image data, the generated tensor can be reshaped back into image format.

Normalizing Flows

Normalizing flows transform a simple distribution into a complex distribution using invertible mappings.

Let

z0p0(z0) z_0 \sim p_0(z_0)

and

zK=fKf1(z0). z_K = f_K\circ \cdots \circ f_1(z_0).

The density is computed using the change-of-variables formula:

logpK(zK)=logp0(z0)k=1Klogdetfkzk1. \log p_K(z_K) = \log p_0(z_0) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|.

Flows are useful because they provide exact likelihoods and expressive distributions.

Their main constraint is architectural: transformations must be invertible and have tractable Jacobian determinants.

Probabilistic Forecasting

Probabilistic modeling is also useful in time series and forecasting.

Instead of predicting a single future value, the model predicts a distribution over future values:

p(yt+1:t+Hx1:t). p(y_{t+1:t+H}\mid x_{1:t}).

Common output distributions include:

DistributionUse case
GaussianContinuous targets
Student-tHeavy-tailed noise
Negative binomialCount data
Quantile outputsPrediction intervals
Mixture distributionsMultimodal futures

A probabilistic forecast can answer questions such as:

  • What is the expected value?
  • What is the 90 percent prediction interval?
  • What is the probability of exceeding a threshold?

Choosing a Distribution

The output distribution should match the target type.

Target typeCommon distribution
Binary valueBernoulli
Class labelCategorical
Real valueNormal, Student-t, mixture of normals
Positive valueLogNormal, Gamma
CountPoisson, negative binomial
Bounded valueBeta
SequenceAutoregressive categorical or continuous likelihood

Poor distribution choice can produce misleading uncertainty. For example, a Gaussian likelihood may handle symmetric noise well but may perform poorly on heavy-tailed data.

Training Loop Pattern

Most probabilistic models follow a common training pattern:

def train_step(model, optimizer, x, y):
    model.train()

    optimizer.zero_grad()

    dist = model(x)
    loss = -dist.log_prob(y).mean()

    loss.backward()
    optimizer.step()

    return loss.item()

For latent variable models, the loss may include extra KL terms:

loss = reconstruction_loss + kl_loss

For Bayesian neural networks, the loss may include a KL term between the variational posterior and the prior.

Numerical Stability

Probabilistic models require careful numerical handling.

Use logits instead of probabilities when possible. For example:

Bernoulli(logits=logits)
Categorical(logits=logits)

This avoids unstable calls to log(sigmoid(x)) or log(softmax(x)).

Clamp log standard deviations:

log_std = log_std.clamp(-10.0, 5.0)

This prevents standard deviations from becoming zero or exploding.

Add small epsilons when computing logs manually:

torch.log(x + 1e-8)

Prefer library distribution methods when possible, since they often implement stable formulas internally.

Evaluation

Probabilistic models should be evaluated with probabilistic metrics.

MetricMeaning
Negative log likelihoodQuality of predicted distribution
Calibration errorWhether probabilities match frequencies
Prediction interval coverageWhether intervals contain true values
SharpnessHow concentrated predictions are
Brier scoreAccuracy of probabilistic classification
CRPSProbabilistic forecasting quality

Accuracy alone is insufficient. A probabilistic model should be both accurate and calibrated.

Summary

Practical probabilistic modeling in PyTorch usually means making neural networks output probability distributions.

Regression models can output Gaussian or mixture distributions. Classifiers define Bernoulli or categorical distributions. Latent variable models introduce hidden variables and optimize ELBO-style objectives.

The key implementation pattern is simple: the network computes distribution parameters, the distribution defines log_prob, and training minimizes negative log likelihood or negative ELBO.