Skip to content

Probabilistic Programming

Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.

Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.

Automatic differentiation is important because most modern probabilistic inference methods rely on gradients of probability densities, likelihoods, or variational objectives.

A probabilistic model typically defines:

xp(xθ), x \sim p(x \mid \theta),

where:

SymbolMeaning
xxlatent variables
θ\thetamodel parameters
ppprobability distribution

Given observations yy, inference computes the posterior:

p(x,θy). p(x,\theta \mid y).

AD enables efficient optimization and sampling in high-dimensional probabilistic systems.

Probabilistic Programs as Computation Graphs

A probabilistic program mixes deterministic computation with random sampling.

Example:

latent variables
    -> deterministic transforms
    -> likelihood computation
    -> log probability
    -> inference objective

The deterministic parts are ordinary differentiable programs. The probabilistic parts introduce distributions, expectations, and stochasticity.

Inference algorithms often differentiate quantities such as:

logp(yθ), \log p(y \mid \theta), logp(x,θ,y), \log p(x,\theta,y),

or variational objectives.

Log Probability and Gradients

Most gradient-based inference methods work with log densities.

Suppose

p(xy,θ)p(yx,θ)p(xθ). p(x \mid y,\theta) \propto p(y \mid x,\theta)p(x \mid \theta).

Then the log posterior is

logp(xy,θ)=logp(yx,θ)+logp(xθ)logZ. \log p(x \mid y,\theta) = \log p(y \mid x,\theta) + \log p(x \mid \theta) - \log Z.

The normalization constant ZZ is usually ignored during gradient computation because it does not depend on xx.

AD computes gradients such as

xlogp(xy,θ). \nabla_x \log p(x \mid y,\theta).

These gradients drive modern inference algorithms.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo augments latent variables with momentum variables:

(x,p). (x,p).

The Hamiltonian is

H(x,p)=U(x)+K(p), H(x,p) = U(x)+K(p),

where

U(x)=logp(xy), U(x)=-\log p(x \mid y),

and

K(p)=12pM1p. K(p)=\frac{1}{2}p^\top M^{-1}p.

The dynamics are

dxdt=Hp,dpdt=Hx. \frac{dx}{dt}=\frac{\partial H}{\partial p}, \qquad \frac{dp}{dt}=-\frac{\partial H}{\partial x}.

AD computes

xU(x), \nabla_x U(x),

which is the gradient of the negative log posterior.

Without AD, deriving gradients for large probabilistic models would be extremely tedious.

Variational Inference

Variational inference approximates the posterior using a simpler distribution

qϕ(x). q_\phi(x).

The objective is usually the evidence lower bound:

L(ϕ)=Eqϕ(x)[logp(x,y)logqϕ(x)]. \mathcal{L}(\phi) = \mathbb{E}_{q_\phi(x)} \left[ \log p(x,y)-\log q_\phi(x) \right].

The goal is

maxϕL(ϕ). \max_\phi \mathcal{L}(\phi).

AD computes gradients of this objective with respect to variational parameters.

Reparameterization Trick

Direct differentiation through random sampling is difficult.

Suppose

xqϕ(x). x \sim q_\phi(x).

Instead of sampling xx directly, we write

x=g(ϵ,ϕ),ϵp(ϵ), x = g(\epsilon,\phi), \qquad \epsilon \sim p(\epsilon),

where the randomness is separated from the parameters.

For a Gaussian:

x=μ+σϵ,ϵN(0,1). x = \mu + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1).

Now expectations become

Eqϕ(x)[f(x)]=Ep(ϵ)[f(g(ϵ,ϕ))]. \mathbb{E}_{q_\phi(x)}[f(x)] = \mathbb{E}_{p(\epsilon)} [f(g(\epsilon,\phi))].

AD can differentiate through gg.

This produces lower-variance gradient estimators than score-function methods.

Score-Function Estimators

Some distributions cannot be reparameterized easily.

In that case, we use the identity

ϕEqϕ(x)[f(x)]=Eqϕ(x)[f(x)ϕlogqϕ(x)]. \nabla_\phi \mathbb{E}_{q_\phi(x)}[f(x)] = \mathbb{E}_{q_\phi(x)} \left[ f(x)\nabla_\phi \log q_\phi(x) \right].

This is the score-function estimator, also called REINFORCE.

It works for discrete variables and non-reparameterizable distributions, but it often has high variance.

MethodStrengthWeakness
ReparameterizationLow varianceRequires differentiable transform
Score-function estimatorGeneralHigh variance
Hybrid methodsFlexibleMore complex

AD computes the deterministic derivatives inside these estimators.

Stochastic Computation Graphs

Probabilistic programs are stochastic computation graphs.

Nodes may represent:

Node typeMeaning
Deterministic nodeOrdinary differentiable computation
Random nodeSampling operation
Observation nodeConditioning on data
Objective nodeLog probability or reward

Gradient propagation depends on node type.

Deterministic paths use ordinary chain rule differentiation. Random paths require estimator identities.

Discrete Random Variables

Discrete latent variables create differentiation problems because sampling is discontinuous.

Example:

zCategorical(π). z \sim \operatorname{Categorical}(\pi).

The sampled value changes discontinuously as π\pi changes.

Approaches include:

MethodIdea
Score-function estimatorDifferentiate probability, not sample
Gumbel-softmaxContinuous relaxation
MarginalizationSum exactly over states
Straight-through estimatorApproximate backward pass

The Gumbel-softmax relaxation replaces hard categories with differentiable soft assignments:

yi=exp((logπi+gi)/τ)jexp((logπj+gj)/τ). y_i = \frac{ \exp((\log \pi_i + g_i)/\tau) }{ \sum_j \exp((\log \pi_j + g_j)/\tau) }.

As temperature τ\tau decreases, the distribution becomes more discrete.

Probabilistic Graphical Models

Many probabilistic systems are structured graphical models.

Examples include:

  • Bayesian networks,
  • hidden Markov models,
  • factor graphs,
  • state-space models,
  • probabilistic circuits.

Inference often reduces to repeated local message computations.

AD is useful for:

  • learning parameters,
  • differentiable message passing,
  • variational optimization,
  • differentiable filtering and smoothing.

State-Space Models

A probabilistic state-space model evolves latent states:

xt+1p(xt+1xt), x_{t+1}\sim p(x_{t+1}\mid x_t), ytp(ytxt). y_t\sim p(y_t\mid x_t).

Inference computes posterior trajectories:

p(x0:Ty0:T). p(x_{0:T}\mid y_{0:T}).

Differentiable filtering and smoothing algorithms allow gradient-based learning of:

  • transition dynamics,
  • observation models,
  • noise covariances,
  • control parameters.

Kalman filters are differentiable under standard linear algebra assumptions. Particle filters are more difficult because resampling is discrete.

Particle Filters

Particle filters approximate distributions using weighted samples.

A resampling step selects particles according to weights:

wip(ytxi). w_i \propto p(y_t \mid x_i).

Resampling is discrete and therefore non-differentiable in the usual sense.

Differentiable particle filtering methods use:

MethodIdea
Soft resamplingSmooth weight redistribution
Relaxed categorical samplingContinuous approximation
Gradient estimatorsDifferentiate expectation
Deterministic transportReplace stochastic resampling

This remains an active research area.

Bayesian Neural Networks

A Bayesian neural network places distributions over weights:

wp(w). w \sim p(w).

Predictions marginalize over weights:

p(yx)=p(yx,w)p(w)dw. p(y \mid x) = \int p(y \mid x,w)p(w)\,dw.

Variational inference approximates the posterior over weights.

AD computes gradients through:

  • neural network computations,
  • variational objectives,
  • stochastic parameter samples.

This combines probabilistic inference with deep learning infrastructure.

Normalizing Flows

A normalizing flow transforms a simple distribution into a complex one using invertible mappings.

Suppose

z0p0(z), z_0 \sim p_0(z),

and

zK=fKf1(z0). z_K = f_K \circ \cdots \circ f_1(z_0).

The density becomes

logp(zK)=logp0(z0)klogdetfkzk1. \log p(z_K) = \log p_0(z_0) - \sum_k \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|.

AD computes:

  • Jacobian determinants,
  • parameter gradients,
  • inverse transform sensitivities.

Flows are fundamentally differentiable probabilistic systems.

Probabilistic Programs with Control Flow

Probabilistic programs may contain loops, branching, and recursion.

Example:

sample latent state
if condition:
    sample another variable
else:
    apply deterministic transform
accumulate log probability

Control flow creates challenges:

IssueConsequence
Dynamic graph structureVariable execution paths
Discrete branchingPiecewise derivatives
Variable-dimensional latent spaceComplex gradient semantics
Recursive stochastic structureDynamic memory requirements

Modern probabilistic programming systems often use tracing or graph capture to represent execution.

Log-Density Stability

Probabilistic inference is numerically sensitive.

Probabilities may underflow:

p(x)10300. p(x)\approx 10^{-300}.

Therefore computations are usually performed in log space.

AD systems must preserve stable numerical forms such as:

logiexi. \log \sum_i e^{x_i}.

Direct exponentiation and summation can produce catastrophic overflow or underflow.

Stable primitives should expose stable derivatives.

Automatic Differentiation Variational Inference

Automatic differentiation variational inference, or ADVI, treats inference generically:

  1. transform latent variables to unconstrained space,
  2. define variational distribution,
  3. estimate ELBO gradients,
  4. optimize with stochastic gradient methods.

The important idea is that inference becomes a differentiable optimization problem.

The probabilistic model no longer needs manually derived updates.

Constraints and Transformations

Latent variables may have constraints:

Variable typeConstraint
Variancepositive
Probability vectorsums to one
Covariance matrixpositive definite
Correlationbounded

Inference systems transform constrained variables into unconstrained coordinates.

Examples:

x=exp(z), x = \exp(z), pi=exp(zi)jexp(zj). p_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}.

AD differentiates through these transforms automatically.

Sparse and Structured Probabilistic Systems

Large probabilistic systems often have sparse dependence graphs.

Examples:

  • sparse Gaussian processes,
  • factor graphs,
  • probabilistic graphical models,
  • structured variational families.

Efficient differentiation exploits this structure.

Dense Jacobian construction is usually infeasible.

Failure Modes

Differentiable probabilistic systems fail in characteristic ways.

Failure modeCause
Exploding gradientsSharp posterior geometry
Vanishing gradientsSaturated likelihoods
High-variance estimatorsStochastic gradient noise
Unstable ELBO optimizationPoor variational family
NaNs in log probabilityUnderflow or invalid parameters
Broken inferenceDiscrete non-differentiable operations
Memory explosionReverse mode through long stochastic traces

Gradient correctness alone does not guarantee good inference behavior.

Practical Architecture

A robust differentiable probabilistic system typically separates:

LayerResponsibility
Random variable representationSampling and constraints
Deterministic transformsDifferentiable computation
Log-density accumulationStable probability evaluation
Inference objectiveELBO, posterior log density, likelihood
Gradient estimatorReparameterization or score-function
Optimizer or samplerVariational optimization or MCMC

Each layer needs carefully defined derivative semantics.

Summary

Probabilistic programming turns inference into a differentiable computation problem. Automatic differentiation supplies gradients for posterior densities, variational objectives, stochastic simulators, and probabilistic transformations.

The main challenges are stochastic sampling, discrete variables, estimator variance, dynamic control flow, and numerical stability in probability computations. Effective systems combine AD with reparameterization, stable log-density algebra, structured inference algorithms, and carefully designed gradient estimators.