Probabilistic Automatic Differentiation

Classical automatic differentiation computes derivatives of deterministic programs.

A probabilistic program instead describes random variables, probability distributions, and stochastic transformations. The output is not a single value but a distribution, expectation, likelihood, or sampled trajectory.

Probabilistic automatic differentiation studies how derivatives propagate through such systems.

This includes:

Problem	Example
differentiating expectations	stochastic optimization
differentiating sampling procedures	variational inference
differentiating probabilistic programs	Bayesian learning
differentiating Monte Carlo estimators	simulation gradients
differentiating stochastic dynamics	diffusion models

The central challenge is that randomness introduces discontinuities, variance, and estimator bias into the differentiation process.

Deterministic vs Stochastic Computation

A deterministic computation defines

y = f(x,\theta).

A stochastic computation introduces random variables:

y = f(x,\theta,\omega),

where

\omega \sim p(\omega).

The quantity of interest is often an expectation:

L(\theta) = \mathbb{E}_{\omega \sim p_\theta} [\ell(f(\theta,\omega))].

The derivative becomes

\nabla_\theta L(\theta).

The difficulty is that both the sampled value and the distribution itself may depend on θ.

Differentiating Expectations

Suppose

L(\theta)=\mathbb{E}_{x\sim p_\theta(x)}[\ell(x)].

Expanding the expectation:

L(\theta) = \int \ell(x)p_\theta(x)\,dx.

Differentiate under the integral:

\nabla_\theta L = \int \ell(x)\nabla_\theta p_\theta(x)\,dx.

Using

\nabla_\theta p_\theta(x) = p_\theta(x)\nabla_\theta \log p_\theta(x),

we obtain

\nabla_\theta L = \mathbb{E}_{x\sim p_\theta} [ \ell(x)\nabla_\theta \log p_\theta(x) ].

This is the score-function estimator.

It is also called:

Name	Context
REINFORCE estimator	reinforcement learning
likelihood-ratio estimator	statistics
score-function gradient	probabilistic inference

The estimator does not require differentiating through the sampled value itself.

Score-Function Estimator

The score-function estimator is

\nabla_\theta \mathbb{E}_{x\sim p_\theta} [\ell(x)] = \mathbb{E} [ \ell(x)\nabla_\theta \log p_\theta(x) ].

Monte Carlo approximation gives

\nabla_\theta L \approx \frac{1}{N} \sum_{i=1}^N \ell(x_i)\nabla_\theta \log p_\theta(x_i).

This estimator is general.

It works even when:

Situation	Supported
discrete random variables	yes
non-differentiable samples	yes
black-box simulators	yes

However, it often has very high variance.

Large variance leads to unstable optimization and slow convergence.

Reparameterization Trick

Suppose samples can be written as

x = g(\theta,\epsilon), \qquad \epsilon \sim p(\epsilon),

where the randomness is independent of θ.

Then

L(\theta) = \mathbb{E}_{\epsilon} [ \ell(g(\theta,\epsilon)) ].

Now the expectation is over a fixed distribution. The derivative becomes

\nabla_\theta L = \mathbb{E}_{\epsilon} [ \nabla_\theta \ell(g(\theta,\epsilon)) ].

This allows ordinary reverse-mode AD through the sampled computation.

This is the reparameterization estimator.

Gaussian Example

Suppose

x \sim \mathcal{N}(\mu,\sigma^2).

Reparameterize:

x = \mu + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1).

Then

L(\mu,\sigma) = \mathbb{E}_\epsilon [ \ell(\mu+\sigma\epsilon) ].

Gradients become

\nabla_\mu L = \mathbb{E} [ \nabla_x \ell(x) ],

and

\nabla_\sigma L = \mathbb{E} [ \epsilon \nabla_x \ell(x) ].

The stochasticity is isolated in ε. The remaining computation is differentiable.

This estimator usually has much lower variance than the score-function estimator.

Pathwise Derivatives

The reparameterization trick is also called the pathwise derivative estimator.

The derivative propagates through the sampled path itself:

epsilon -> sample x -> loss

The stochastic node becomes a differentiable transformation.

This makes probabilistic programs compatible with ordinary reverse-mode AD systems.

Comparison of Gradient Estimators

Estimator	Requires differentiable sample path	Supports discrete variables	Variance
score-function	no	yes	high
reparameterization	yes	usually no	lower
finite differences	no	yes	very high
implicit estimators	partial	partial	moderate

No estimator is uniformly best.

The choice depends on distribution structure and computational constraints.

Variance Reduction

Monte Carlo gradient estimators are noisy.

Variance reduction is therefore central in probabilistic differentiation.

Baselines

Subtract a constant b:

\mathbb{E} [ (\ell(x)-b)\nabla_\theta \log p_\theta(x) ].

The estimator remains unbiased because

\mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0.

A good baseline reduces variance dramatically.

Control variates

Introduce correlated auxiliary estimators with known expectation.

Antithetic sampling

Use negatively correlated samples.

Rao-Blackwellization

Integrate analytically over some variables instead of sampling them.

These methods are essential for practical stochastic gradient estimation.

Discrete Random Variables

Discrete sampling is difficult because sampled values change discontinuously.

Suppose

x \sim \operatorname{Categorical}(p_\theta).

A tiny parameter perturbation may abruptly change the sampled category.

Ordinary pathwise differentiation fails because:

\frac{\partial x}{\partial \theta}

does not exist in the classical sense.

The score-function estimator still works because it differentiates the probability distribution rather than the sampled value.

Relaxed Distributions

A common workaround replaces discrete variables with continuous approximations.

For categorical sampling, the Gumbel-Softmax trick uses:

y_i = \frac{ \exp((\log p_i + g_i)/\tau) }{ \sum_j \exp((\log p_j + g_j)/\tau) },

where:

g_i \sim \operatorname{Gumbel}(0,1).

As temperature τ approaches zero, the relaxed sample approaches a one-hot discrete sample.

For finite temperature, the sample remains differentiable.

This allows approximate pathwise differentiation through discrete choices.

Probabilistic Computational Graphs

A probabilistic program can be represented as a graph containing both deterministic and stochastic nodes.

Example:

theta -> z ~ p(z|theta)
z -> x ~ p(x|z)
x -> loss

Differentiation propagates through:

Node type	Gradient rule
deterministic node	ordinary chain rule
stochastic node	estimator-specific rule

Modern probabilistic programming systems combine AD with stochastic estimators to differentiate entire probabilistic models.

Variational Inference

Variational inference optimizes an approximate distribution

q_\phi(z)

to approximate a target posterior.

The evidence lower bound (ELBO) is

\mathcal{L}(\phi) = \mathbb{E}_{z\sim q_\phi} [ \log p(x,z)-\log q_\phi(z) ].

Gradients require differentiating expectations over learned distributions.

Reparameterization gradients made deep variational models practical.

Variational autoencoders are a canonical example.

Variational Autoencoders

A variational autoencoder defines:

Component	Role
encoder	parameterizes latent distribution
latent variable	sampled representation
decoder	reconstructs data

The encoder predicts:

\mu(x),\quad \sigma(x).

A latent sample is drawn using reparameterization:

z=\mu+\sigma\epsilon.

The decoder computes reconstruction loss.

Reverse-mode AD then differentiates the entire stochastic pipeline.

Without reparameterization, efficient training would be much harder.

Probabilistic Programs

A probabilistic program includes random choices:

z = sample(normal(mu, sigma))
x = sample(decoder(z))
observe(data, x)

The program defines a probability distribution over execution traces.

Differentiation may involve:

Quantity	Meaning
log probability	likelihood gradient
posterior expectation	inference objective
sampled trajectory	simulation sensitivity

Probabilistic AD systems combine tracing, sampling, and reverse-mode differentiation.

Monte Carlo Differentiation

Suppose

L(\theta)=\mathbb{E}[f_\theta(X)].

Monte Carlo estimates use samples:

L_N(\theta) = \frac{1}{N} \sum_{i=1}^N f_\theta(X_i).

Differentiating gives

\nabla_\theta L_N = \frac{1}{N} \sum_i \nabla_\theta f_\theta(X_i).

This estimator itself becomes random.

Thus optimization uses stochastic gradients of stochastic objectives.

Understanding variance propagation becomes critical.

Stochastic Differential Equations

Probabilistic dynamics often use stochastic differential equations:

dz = f(z,t)\,dt + g(z,t)\,dW_t,

where W_t is Brownian motion.

These systems appear in:

Domain	Example
diffusion models	generative modeling
finance	stochastic volatility
physics	thermal noise
biology	random population dynamics

Differentiation through SDEs requires handling stochastic integrals and noise-dependent trajectories.

Diffusion Models

Modern generative diffusion models evolve data through noisy dynamics.

Forward diffusion adds noise:

dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dW_t.

Reverse diffusion learns to invert the stochastic process.

Training involves expectations over noisy trajectories and repeated stochastic sampling.

Probabilistic AD is fundamental to these systems.

Measure-Theoretic Issues

Differentiating probabilistic systems introduces mathematical subtleties.

Questions include:

Question	Issue
can derivative move inside expectation?	dominated convergence
does density exist?	measure regularity
is estimator unbiased?	interchange of limits
does variance exist?	integrability

Many practical estimators rely on assumptions that may fail in heavy-tailed or discontinuous systems.

Stochastic Control Flow

Programs with random branching are especially difficult.

Example:

if sample(bernoulli(p)):
    y = f1(theta)
else:
    y = f2(theta)

The execution trace itself becomes stochastic.

The derivative depends on both:

Component	Effect
branch probability	score-function term
branch computation	pathwise term

Hybrid estimators are often required.

Gradient Estimator Bias

Some estimators are unbiased:

\mathbb{E}[\hat{g}] = \nabla_\theta L.

Others trade bias for lower variance.

A low-variance biased estimator may outperform a theoretically correct unbiased estimator in optimization.

This creates a central engineering tradeoff:

Goal	Cost
unbiasedness	high variance
low variance	possible bias

Practical probabilistic learning often prefers stable optimization over exact gradient fidelity.

Probabilistic AD Systems

Modern systems combine:

Capability	Purpose
reverse-mode AD	deterministic differentiation
stochastic estimators	random variables
trace graphs	probabilistic execution
Monte Carlo sampling	expectation approximation
symbolic density tracking	log-likelihood computation

Examples include probabilistic programming frameworks and differentiable simulators.

Failure Modes

Probabilistic differentiation introduces many instability sources.

High variance

Gradient estimates may fluctuate wildly.

Rare-event instability

Extreme samples dominate gradients.

Discontinuous sampling

Discrete variables create undefined pathwise derivatives.

Monte Carlo noise

Optimization may become noisy or biased.

Numerical underflow

Tiny probabilities destabilize log-likelihoods.

Correlated randomness

Dependent samples complicate variance analysis.

These issues often dominate runtime behavior.

Conceptual Shift

Classical AD differentiates functions.

Probabilistic AD differentiates distributions, expectations, and stochastic processes.

The chain rule alone is no longer sufficient. Gradient estimation becomes a statistical problem as well as a computational one.

This changes the meaning of differentiation itself.

Instead of asking:

\frac{dy}{d\theta},

we ask:

\nabla_\theta \mathbb{E}[y].

The derivative becomes an expectation over random trajectories.

Summary

Probabilistic automatic differentiation extends AD into stochastic systems.

Differentiation may occur through expectations, random variables, Monte Carlo estimators, stochastic differential equations, or probabilistic programs.

The two dominant techniques are:

Method	Core idea
score-function estimators	differentiate probabilities
reparameterization estimators	differentiate sampled paths

These methods make modern probabilistic machine learning practical, including variational inference, stochastic simulation, diffusion models, and probabilistic programming.

The central challenge is no longer only correctness of the chain rule. It is managing variance, bias, stochasticity, and numerical stability while preserving useful gradients through random computation.