# Practical Probabilistic Modeling in PyTorch

Probabilistic deep learning adds distributions to ordinary neural networks. A model no longer predicts only a value. It predicts parameters of a probability distribution, samples from latent variables, estimates uncertainty, or defines a likelihood for observed data.

In PyTorch, this usually means combining three pieces:

| Piece | Role |
|---|---|
| Neural network | Computes distribution parameters |
| Probability distribution | Defines likelihood or sampling rule |
| Loss function | Optimizes negative log likelihood or ELBO |

The neural network remains deterministic in its computation. The probabilistic part appears in how we interpret its outputs.

### Deterministic Versus Probabilistic Outputs

A deterministic regression model predicts

$$
\hat{y}=f_\theta(x).
$$

A probabilistic regression model predicts a distribution:

$$
p(y\mid x) =
\mathcal{N}(y;\mu_\theta(x),\sigma_\theta^2(x)).
$$

The network outputs both $\mu_\theta(x)$ and $\sigma_\theta^2(x)$. The mean gives the central prediction. The variance gives uncertainty.

For classification, the network outputs logits:

$$
z=f_\theta(x).
$$

The logits define a categorical distribution:

$$
p(y=k\mid x) =
\mathrm{softmax}(z)_k.
$$

Thus, ordinary classification with cross-entropy already has a probabilistic interpretation.

### Using `torch.distributions`

PyTorch provides probability distributions through `torch.distributions`.

A normal distribution can be created as follows:

```python
import torch
from torch.distributions import Normal

mu = torch.tensor([0.0])
sigma = torch.tensor([1.0])

dist = Normal(mu, sigma)

sample = dist.sample()
log_prob = dist.log_prob(sample)
```

The method `sample()` draws random values. The method `log_prob()` evaluates the log probability density.

This is useful because many probabilistic losses are negative log likelihoods:

$$
\mathcal{L} =
-\log p(y\mid x).
$$

### Gaussian Regression Model

A Gaussian regression model predicts a mean and a variance.

```python
import torch
from torch import nn
from torch.distributions import Normal

class GaussianRegressor(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.ReLU(),
        )

        self.mean = nn.Linear(hidden_features, 1)
        self.log_std = nn.Linear(hidden_features, 1)

    def forward(self, x):
        h = self.net(x)

        mean = self.mean(h)
        log_std = self.log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)
```

The model returns a distribution object, not a raw tensor.

Training uses negative log likelihood:

```python
def gaussian_nll_loss(model, x, y):
    dist = model(x)
    loss = -dist.log_prob(y).mean()
    return loss
```

This loss rewards accurate means and calibrated variances. A model cannot make the variance arbitrarily large without paying a log-standard-deviation penalty.

### Bernoulli Models

For binary prediction, the Bernoulli distribution is common.

A network can output logits:

```python
from torch.distributions import Bernoulli

class BernoulliClassifier(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, 1),
        )

    def forward(self, x):
        logits = self.net(x)
        return Bernoulli(logits=logits)
```

The loss is again negative log likelihood:

```python
def bernoulli_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()
```

This is equivalent to binary cross-entropy with logits, but the probabilistic version makes the modeling assumption explicit.

### Categorical Models

Multiclass classification uses the categorical distribution.

```python
from torch.distributions import Categorical

class CategoricalClassifier(nn.Module):
    def __init__(self, in_features, hidden_features, num_classes):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, num_classes),
        )

    def forward(self, x):
        logits = self.net(x)
        return Categorical(logits=logits)
```

The negative log likelihood is:

```python
def categorical_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()
```

Here `y` should contain integer class indices.

This is equivalent to `torch.nn.functional.cross_entropy`, but the distribution form generalizes more naturally to probabilistic models.

### Mixture Density Networks

A single Gaussian may be too simple. Some regression problems are multimodal: the same input can lead to several plausible outputs.

A mixture density network predicts a mixture distribution:

$$
p(y\mid x) =
\sum_{k=1}^{K}
\pi_k(x)
\mathcal{N}(y;\mu_k(x),\sigma_k^2(x)).
$$

Here:

| Symbol | Meaning |
|---|---|
| $K$ | Number of mixture components |
| $\pi_k(x)$ | Mixture weight |
| $\mu_k(x)$ | Component mean |
| $\sigma_k(x)$ | Component standard deviation |

In PyTorch:

```python
from torch.distributions import Categorical, Normal, MixtureSameFamily

class MixtureDensityNetwork(nn.Module):
    def __init__(self, in_features, hidden_features, num_components):
        super().__init__()

        self.num_components = num_components

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Tanh(),
            nn.Linear(hidden_features, hidden_features),
            nn.Tanh(),
        )

        self.logits = nn.Linear(hidden_features, num_components)
        self.means = nn.Linear(hidden_features, num_components)
        self.log_stds = nn.Linear(hidden_features, num_components)

    def forward(self, x):
        h = self.net(x)

        logits = self.logits(h)
        means = self.means(h)
        stds = self.log_stds(h).clamp(-10.0, 5.0).exp()

        mixture = Categorical(logits=logits)
        components = Normal(means, stds)

        return MixtureSameFamily(mixture, components)
```

Training:

```python
def mdn_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y.squeeze(-1)).mean()
```

Mixture density networks are useful when the conditional target distribution has several modes.

### Latent Variable Models

A latent variable model introduces an unobserved variable $z$:

$$
p_\theta(x,z) =
p_\theta(x\mid z)p(z).
$$

The latent variable may represent hidden factors of variation. In an image model, $z$ may encode object identity, pose, lighting, or style.

The marginal likelihood is

$$
p_\theta(x) =
\int p_\theta(x\mid z)p(z)\,dz.
$$

This integral is usually intractable. Variational inference introduces an approximate posterior:

$$
q_\phi(z\mid x).
$$

This is the foundation of variational autoencoders.

### Minimal Variational Autoencoder

A variational autoencoder contains an encoder, a decoder, and a latent distribution.

The encoder predicts

$$
q_\phi(z\mid x) =
\mathcal{N}(z;\mu_\phi(x),\sigma_\phi^2(x)).
$$

The decoder predicts

$$
p_\theta(x\mid z).
$$

A minimal PyTorch VAE:

```python
import torch
from torch import nn
from torch.distributions import Normal, Bernoulli
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
        )

        self.z_mean = nn.Linear(hidden_dim, latent_dim)
        self.z_log_std = nn.Linear(hidden_dim, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def encode(self, x):
        h = self.encoder(x)
        mean = self.z_mean(h)
        log_std = self.z_log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)

    def decode(self, z):
        logits = self.decoder(z)
        return Bernoulli(logits=logits)

    def forward(self, x):
        q_z = self.encode(x)
        z = q_z.rsample()
        p_x = self.decode(z)

        return p_x, q_z, z
```

The method `rsample()` uses the reparameterization trick, allowing gradients to pass through the sampled latent variable.

### VAE Loss

The VAE optimizes the negative ELBO.

$$
\mathcal{L} =
-\mathbb{E}_{q_\phi(z\mid x)}
[
\log p_\theta(x\mid z)
]
+
D_{KL}(q_\phi(z\mid x)\|p(z)).
$$

For a standard normal prior $p(z)=\mathcal{N}(0,I)$, the KL term has a closed form when $q_\phi$ is diagonal Gaussian.

```python
def vae_loss(model, x):
    p_x, q_z, z = model(x)

    reconstruction_loss = -p_x.log_prob(x).sum(dim=-1).mean()

    mean = q_z.loc
    std = q_z.scale
    var = std.pow(2)

    kl = 0.5 * (
        mean.pow(2) + var - var.log() - 1.0
    ).sum(dim=-1).mean()

    return reconstruction_loss + kl
```

The reconstruction term trains the decoder. The KL term keeps the approximate posterior close to the prior.

### Sampling from a VAE

After training, we can generate samples by drawing from the prior:

$$
z\sim\mathcal{N}(0,I),
$$

then decoding:

$$
x\sim p_\theta(x\mid z).
$$

In PyTorch:

```python
@torch.no_grad()
def sample_vae(model, num_samples, latent_dim, device):
    z = torch.randn(num_samples, latent_dim, device=device)
    p_x = model.decode(z)

    return p_x.sample()
```

For image data, the generated tensor can be reshaped back into image format.

### Normalizing Flows

Normalizing flows transform a simple distribution into a complex distribution using invertible mappings.

Let

$$
z_0 \sim p_0(z_0)
$$

and

$$
z_K = f_K\circ \cdots \circ f_1(z_0).
$$

The density is computed using the change-of-variables formula:

$$
\log p_K(z_K) =
\log p_0(z_0) -
\sum_{k=1}^{K}
\log
\left|
\det
\frac{\partial f_k}{\partial z_{k-1}}
\right|.
$$

Flows are useful because they provide exact likelihoods and expressive distributions.

Their main constraint is architectural: transformations must be invertible and have tractable Jacobian determinants.

### Probabilistic Forecasting

Probabilistic modeling is also useful in time series and forecasting.

Instead of predicting a single future value, the model predicts a distribution over future values:

$$
p(y_{t+1:t+H}\mid x_{1:t}).
$$

Common output distributions include:

| Distribution | Use case |
|---|---|
| Gaussian | Continuous targets |
| Student-t | Heavy-tailed noise |
| Negative binomial | Count data |
| Quantile outputs | Prediction intervals |
| Mixture distributions | Multimodal futures |

A probabilistic forecast can answer questions such as:

- What is the expected value?
- What is the 90 percent prediction interval?
- What is the probability of exceeding a threshold?

### Choosing a Distribution

The output distribution should match the target type.

| Target type | Common distribution |
|---|---|
| Binary value | Bernoulli |
| Class label | Categorical |
| Real value | Normal, Student-t, mixture of normals |
| Positive value | LogNormal, Gamma |
| Count | Poisson, negative binomial |
| Bounded value | Beta |
| Sequence | Autoregressive categorical or continuous likelihood |

Poor distribution choice can produce misleading uncertainty. For example, a Gaussian likelihood may handle symmetric noise well but may perform poorly on heavy-tailed data.

### Training Loop Pattern

Most probabilistic models follow a common training pattern:

```python
def train_step(model, optimizer, x, y):
    model.train()

    optimizer.zero_grad()

    dist = model(x)
    loss = -dist.log_prob(y).mean()

    loss.backward()
    optimizer.step()

    return loss.item()
```

For latent variable models, the loss may include extra KL terms:

```python
loss = reconstruction_loss + kl_loss
```

For Bayesian neural networks, the loss may include a KL term between the variational posterior and the prior.

### Numerical Stability

Probabilistic models require careful numerical handling.

Use logits instead of probabilities when possible. For example:

```python
Bernoulli(logits=logits)
Categorical(logits=logits)
```

This avoids unstable calls to `log(sigmoid(x))` or `log(softmax(x))`.

Clamp log standard deviations:

```python
log_std = log_std.clamp(-10.0, 5.0)
```

This prevents standard deviations from becoming zero or exploding.

Add small epsilons when computing logs manually:

```python
torch.log(x + 1e-8)
```

Prefer library distribution methods when possible, since they often implement stable formulas internally.

### Evaluation

Probabilistic models should be evaluated with probabilistic metrics.

| Metric | Meaning |
|---|---|
| Negative log likelihood | Quality of predicted distribution |
| Calibration error | Whether probabilities match frequencies |
| Prediction interval coverage | Whether intervals contain true values |
| Sharpness | How concentrated predictions are |
| Brier score | Accuracy of probabilistic classification |
| CRPS | Probabilistic forecasting quality |

Accuracy alone is insufficient. A probabilistic model should be both accurate and calibrated.

### Summary

Practical probabilistic modeling in PyTorch usually means making neural networks output probability distributions.

Regression models can output Gaussian or mixture distributions. Classifiers define Bernoulli or categorical distributions. Latent variable models introduce hidden variables and optimize ELBO-style objectives.

The key implementation pattern is simple: the network computes distribution parameters, the distribution defines `log_prob`, and training minimizes negative log likelihood or negative ELBO.

