# Summary and Further Reading

Probabilistic deep learning extends neural networks with explicit probability models. Instead of producing only point estimates, a probabilistic model represents uncertainty, likelihood, latent structure, or a posterior distribution over parameters.

This chapter covered five core ideas.

First, Bayesian neural networks treat weights as random variables. A prior describes plausible parameters before observing data. The posterior updates this belief after seeing data. Prediction averages over plausible networks rather than relying on one fixed parameter setting.

Second, variational inference turns posterior inference into optimization. It introduces an approximate posterior and fits it by maximizing the evidence lower bound. This makes Bayesian learning usable with neural networks, although the approximation can be crude.

Third, Monte Carlo methods approximate expectations using samples. They appear in posterior prediction, variational inference, dropout uncertainty, latent variable models, diffusion models, and reinforcement learning.

Fourth, uncertainty estimation separates prediction from confidence. Aleatoric uncertainty comes from irreducible data noise. Epistemic uncertainty comes from lack of knowledge. Ensembles, Bayesian methods, dropout sampling, probabilistic heads, calibration, and conformal prediction are common tools.

Fifth, Gaussian processes provide a distribution over functions. They give exact Bayesian regression in small settings, strong uncertainty estimates, and a theoretical bridge between kernel methods and infinitely wide neural networks.

### Core Equations

Bayesian posterior:

$$
p(\theta \mid D) =
\frac{p(D \mid \theta)p(\theta)}{p(D)}.
$$

Posterior predictive distribution:

$$
p(y^\star \mid x^\star,D) =
\int p(y^\star \mid x^\star,\theta)p(\theta\mid D)\,d\theta.
$$

Monte Carlo approximation:

$$
p(y^\star \mid x^\star,D)
\approx
\frac{1}{S}
\sum_{s=1}^{S}
p(y^\star \mid x^\star,\theta_s).
$$

Variational ELBO:

$$
\mathcal{L}(\phi) =
\mathbb{E}_{q_\phi(\theta)}
[
\log p(D\mid\theta)
] -
\mathrm{KL}(q_\phi(\theta)\|p(\theta)).
$$

Gaussian process prior:

$$
f \sim \mathcal{GP}(m(x),k(x,x')).
$$

Gaussian process predictive mean:

$$
\mu_\star =
k_\star^\top(K+\sigma_n^2I)^{-1}y.
$$

Gaussian process predictive variance:

$$
\sigma_\star^2 =
k(x_\star,x_\star) -
k_\star^\top(K+\sigma_n^2I)^{-1}k_\star.
$$

### Practical Patterns

Most probabilistic PyTorch models follow one of these patterns.

| Pattern | Network output | Loss |
|---|---|---|
| Gaussian regression | Mean and variance | Gaussian negative log likelihood |
| Classification | Logits | Cross-entropy or categorical NLL |
| Mixture density network | Mixture weights, means, variances | Mixture negative log likelihood |
| VAE | Latent posterior and decoder likelihood | Negative ELBO |
| Bayesian neural network | Weight posterior parameters | Negative ELBO |
| Ensemble | Multiple model predictions | Averaged likelihood or probability |
| MC dropout | Stochastic forward passes | Averaged prediction at inference |

The common implementation idea is direct: compute distribution parameters, construct a probability distribution, evaluate `log_prob`, and minimize negative log probability.

### When to Use Probabilistic Deep Learning

Use probabilistic methods when the output distribution matters, not only the best prediction.

They are appropriate when:

| Situation | Useful method |
|---|---|
| Noisy regression targets | Gaussian or Student-t output head |
| Multimodal targets | Mixture density network |
| Limited data | Bayesian neural network or Gaussian process |
| Need calibrated probabilities | Temperature scaling, ensembles, Bayesian methods |
| Need prediction intervals | Probabilistic regression or conformal prediction |
| Need latent representations | Variational autoencoder |
| Need sample generation | VAE, flow, diffusion, autoregressive model |
| Expensive experiments | Gaussian process Bayesian optimization |
| Safety-critical deployment | Ensembles, uncertainty thresholds, conformal prediction |

For many production systems, deep ensembles plus calibration provide a strong baseline. Full Bayesian neural networks are conceptually clean but can be harder to scale and tune.

### Common Failure Modes

Probabilistic outputs can look precise while being poorly calibrated. A model can produce a variance, probability, or interval that does not match empirical reality.

Common problems include:

| Failure mode | Description |
|---|---|
| Overconfident softmax | Classifier assigns high probability to wrong outputs |
| Underestimated variance | Regression intervals are too narrow |
| Poor posterior approximation | Variational family misses important uncertainty |
| Bad prior choice | Prior conflicts with the task or data scale |
| Distribution mismatch | Gaussian likelihood used for heavy-tailed targets |
| OOD overconfidence | Model is confident far from training data |
| Sample inefficiency | Monte Carlo estimate has high variance |
| Excessive serving cost | Ensembles or MC sampling are too expensive |

Uncertainty estimates must be validated empirically. Calibration plots, negative log likelihood, coverage tests, and out-of-distribution benchmarks are more informative than accuracy alone.

### Recommended Further Reading

For Bayesian modeling and probabilistic machine learning, read Kevin Murphy’s *Probabilistic Machine Learning: An Introduction* and *Probabilistic Machine Learning: Advanced Topics*.

For Gaussian processes, read Rasmussen and Williams, *Gaussian Processes for Machine Learning*.

For variational inference, read Blei, Kucukelbir, and McAuliffe, “Variational Inference: A Review for Statisticians.”

For Bayesian deep learning, read work on Bayes by Backprop, Monte Carlo dropout, deep ensembles, Laplace approximations, and stochastic gradient MCMC.

For practical PyTorch modeling, study `torch.distributions`, Pyro, NumPyro, GPyTorch, and BoTorch.

### Exercises

1. Implement a Gaussian regression model that predicts both mean and variance. Compare mean squared error with Gaussian negative log likelihood.

2. Train a classifier and measure its expected calibration error. Then apply temperature scaling on a validation set.

3. Implement Monte Carlo dropout for a small image classifier. Compare predictive entropy on in-distribution and out-of-distribution examples.

4. Train an ensemble of three neural networks. Measure disagreement across ensemble members.

5. Implement a mixture density network on a synthetic dataset where one input can map to two possible outputs.

6. Train a small variational autoencoder. Plot samples from the prior and reconstructions from the encoder.

7. Use GPyTorch to fit a Gaussian process regression model on a one-dimensional dataset. Plot predictive mean and uncertainty intervals.

8. Compare a Gaussian process and a neural network on a small-data regression problem.

9. For a Bayesian linear layer, inspect how increasing the KL penalty affects posterior variance.

10. Evaluate prediction interval coverage for a probabilistic regression model. Compare nominal coverage with empirical coverage.

