Skip to content

Summary and Further Reading

Probabilistic deep learning extends neural networks with explicit probability models.

Probabilistic deep learning extends neural networks with explicit probability models. Instead of producing only point estimates, a probabilistic model represents uncertainty, likelihood, latent structure, or a posterior distribution over parameters.

This chapter covered five core ideas.

First, Bayesian neural networks treat weights as random variables. A prior describes plausible parameters before observing data. The posterior updates this belief after seeing data. Prediction averages over plausible networks rather than relying on one fixed parameter setting.

Second, variational inference turns posterior inference into optimization. It introduces an approximate posterior and fits it by maximizing the evidence lower bound. This makes Bayesian learning usable with neural networks, although the approximation can be crude.

Third, Monte Carlo methods approximate expectations using samples. They appear in posterior prediction, variational inference, dropout uncertainty, latent variable models, diffusion models, and reinforcement learning.

Fourth, uncertainty estimation separates prediction from confidence. Aleatoric uncertainty comes from irreducible data noise. Epistemic uncertainty comes from lack of knowledge. Ensembles, Bayesian methods, dropout sampling, probabilistic heads, calibration, and conformal prediction are common tools.

Fifth, Gaussian processes provide a distribution over functions. They give exact Bayesian regression in small settings, strong uncertainty estimates, and a theoretical bridge between kernel methods and infinitely wide neural networks.

Core Equations

Bayesian posterior:

p(θD)=p(Dθ)p(θ)p(D). p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}.

Posterior predictive distribution:

p(yx,D)=p(yx,θ)p(θD)dθ. p(y^\star \mid x^\star,D) = \int p(y^\star \mid x^\star,\theta)p(\theta\mid D)\,d\theta.

Monte Carlo approximation:

p(yx,D)1Ss=1Sp(yx,θs). p(y^\star \mid x^\star,D) \approx \frac{1}{S} \sum_{s=1}^{S} p(y^\star \mid x^\star,\theta_s).

Variational ELBO:

L(ϕ)=Eqϕ(θ)[logp(Dθ)]KL(qϕ(θ)p(θ)). \mathcal{L}(\phi) = \mathbb{E}_{q_\phi(\theta)} [ \log p(D\mid\theta) ] - \mathrm{KL}(q_\phi(\theta)\|p(\theta)).

Gaussian process prior:

fGP(m(x),k(x,x)). f \sim \mathcal{GP}(m(x),k(x,x')).

Gaussian process predictive mean:

μ=k(K+σn2I)1y. \mu_\star = k_\star^\top(K+\sigma_n^2I)^{-1}y.

Gaussian process predictive variance:

σ2=k(x,x)k(K+σn2I)1k. \sigma_\star^2 = k(x_\star,x_\star) - k_\star^\top(K+\sigma_n^2I)^{-1}k_\star.

Practical Patterns

Most probabilistic PyTorch models follow one of these patterns.

PatternNetwork outputLoss
Gaussian regressionMean and varianceGaussian negative log likelihood
ClassificationLogitsCross-entropy or categorical NLL
Mixture density networkMixture weights, means, variancesMixture negative log likelihood
VAELatent posterior and decoder likelihoodNegative ELBO
Bayesian neural networkWeight posterior parametersNegative ELBO
EnsembleMultiple model predictionsAveraged likelihood or probability
MC dropoutStochastic forward passesAveraged prediction at inference

The common implementation idea is direct: compute distribution parameters, construct a probability distribution, evaluate log_prob, and minimize negative log probability.

When to Use Probabilistic Deep Learning

Use probabilistic methods when the output distribution matters, not only the best prediction.

They are appropriate when:

SituationUseful method
Noisy regression targetsGaussian or Student-t output head
Multimodal targetsMixture density network
Limited dataBayesian neural network or Gaussian process
Need calibrated probabilitiesTemperature scaling, ensembles, Bayesian methods
Need prediction intervalsProbabilistic regression or conformal prediction
Need latent representationsVariational autoencoder
Need sample generationVAE, flow, diffusion, autoregressive model
Expensive experimentsGaussian process Bayesian optimization
Safety-critical deploymentEnsembles, uncertainty thresholds, conformal prediction

For many production systems, deep ensembles plus calibration provide a strong baseline. Full Bayesian neural networks are conceptually clean but can be harder to scale and tune.

Common Failure Modes

Probabilistic outputs can look precise while being poorly calibrated. A model can produce a variance, probability, or interval that does not match empirical reality.

Common problems include:

Failure modeDescription
Overconfident softmaxClassifier assigns high probability to wrong outputs
Underestimated varianceRegression intervals are too narrow
Poor posterior approximationVariational family misses important uncertainty
Bad prior choicePrior conflicts with the task or data scale
Distribution mismatchGaussian likelihood used for heavy-tailed targets
OOD overconfidenceModel is confident far from training data
Sample inefficiencyMonte Carlo estimate has high variance
Excessive serving costEnsembles or MC sampling are too expensive

Uncertainty estimates must be validated empirically. Calibration plots, negative log likelihood, coverage tests, and out-of-distribution benchmarks are more informative than accuracy alone.

Recommended Further Reading

For Bayesian modeling and probabilistic machine learning, read Kevin Murphy’s Probabilistic Machine Learning: An Introduction and Probabilistic Machine Learning: Advanced Topics.

For Gaussian processes, read Rasmussen and Williams, Gaussian Processes for Machine Learning.

For variational inference, read Blei, Kucukelbir, and McAuliffe, “Variational Inference: A Review for Statisticians.”

For Bayesian deep learning, read work on Bayes by Backprop, Monte Carlo dropout, deep ensembles, Laplace approximations, and stochastic gradient MCMC.

For practical PyTorch modeling, study torch.distributions, Pyro, NumPyro, GPyTorch, and BoTorch.

Exercises

  1. Implement a Gaussian regression model that predicts both mean and variance. Compare mean squared error with Gaussian negative log likelihood.

  2. Train a classifier and measure its expected calibration error. Then apply temperature scaling on a validation set.

  3. Implement Monte Carlo dropout for a small image classifier. Compare predictive entropy on in-distribution and out-of-distribution examples.

  4. Train an ensemble of three neural networks. Measure disagreement across ensemble members.

  5. Implement a mixture density network on a synthetic dataset where one input can map to two possible outputs.

  6. Train a small variational autoencoder. Plot samples from the prior and reconstructions from the encoder.

  7. Use GPyTorch to fit a Gaussian process regression model on a one-dimensional dataset. Plot predictive mean and uncertainty intervals.

  8. Compare a Gaussian process and a neural network on a small-data regression problem.

  9. For a Bayesian linear layer, inspect how increasing the KL penalty affects posterior variance.

  10. Evaluate prediction interval coverage for a probabilistic regression model. Compare nominal coverage with empirical coverage.