Skip to content

Probabilistic Automatic Differentiation

Classical automatic differentiation computes derivatives of deterministic programs.

Classical automatic differentiation computes derivatives of deterministic programs.

A probabilistic program instead describes random variables, probability distributions, and stochastic transformations. The output is not a single value but a distribution, expectation, likelihood, or sampled trajectory.

Probabilistic automatic differentiation studies how derivatives propagate through such systems.

This includes:

ProblemExample
differentiating expectationsstochastic optimization
differentiating sampling proceduresvariational inference
differentiating probabilistic programsBayesian learning
differentiating Monte Carlo estimatorssimulation gradients
differentiating stochastic dynamicsdiffusion models

The central challenge is that randomness introduces discontinuities, variance, and estimator bias into the differentiation process.

Deterministic vs Stochastic Computation

A deterministic computation defines

y=f(x,θ). y = f(x,\theta).

A stochastic computation introduces random variables:

y=f(x,θ,ω), y = f(x,\theta,\omega),

where

ωp(ω). \omega \sim p(\omega).

The quantity of interest is often an expectation:

L(θ)=Eωpθ[(f(θ,ω))]. L(\theta) = \mathbb{E}_{\omega \sim p_\theta} [\ell(f(\theta,\omega))].

The derivative becomes

θL(θ). \nabla_\theta L(\theta).

The difficulty is that both the sampled value and the distribution itself may depend on θ.

Differentiating Expectations

Suppose

L(θ)=Expθ(x)[(x)]. L(\theta)=\mathbb{E}_{x\sim p_\theta(x)}[\ell(x)].

Expanding the expectation:

L(θ)=(x)pθ(x)dx. L(\theta) = \int \ell(x)p_\theta(x)\,dx.

Differentiate under the integral:

θL=(x)θpθ(x)dx. \nabla_\theta L = \int \ell(x)\nabla_\theta p_\theta(x)\,dx.

Using

θpθ(x)=pθ(x)θlogpθ(x), \nabla_\theta p_\theta(x) = p_\theta(x)\nabla_\theta \log p_\theta(x),

we obtain

θL=Expθ[(x)θlogpθ(x)]. \nabla_\theta L = \mathbb{E}_{x\sim p_\theta} [ \ell(x)\nabla_\theta \log p_\theta(x) ].

This is the score-function estimator.

It is also called:

NameContext
REINFORCE estimatorreinforcement learning
likelihood-ratio estimatorstatistics
score-function gradientprobabilistic inference

The estimator does not require differentiating through the sampled value itself.

Score-Function Estimator

The score-function estimator is

θExpθ[(x)]=E[(x)θlogpθ(x)]. \nabla_\theta \mathbb{E}_{x\sim p_\theta} [\ell(x)] = \mathbb{E} [ \ell(x)\nabla_\theta \log p_\theta(x) ].

Monte Carlo approximation gives

θL1Ni=1N(xi)θlogpθ(xi). \nabla_\theta L \approx \frac{1}{N} \sum_{i=1}^N \ell(x_i)\nabla_\theta \log p_\theta(x_i).

This estimator is general.

It works even when:

SituationSupported
discrete random variablesyes
non-differentiable samplesyes
black-box simulatorsyes

However, it often has very high variance.

Large variance leads to unstable optimization and slow convergence.

Reparameterization Trick

Suppose samples can be written as

x=g(θ,ϵ),ϵp(ϵ), x = g(\theta,\epsilon), \qquad \epsilon \sim p(\epsilon),

where the randomness is independent of θ.

Then

L(θ)=Eϵ[(g(θ,ϵ))]. L(\theta) = \mathbb{E}_{\epsilon} [ \ell(g(\theta,\epsilon)) ].

Now the expectation is over a fixed distribution. The derivative becomes

θL=Eϵ[θ(g(θ,ϵ))]. \nabla_\theta L = \mathbb{E}_{\epsilon} [ \nabla_\theta \ell(g(\theta,\epsilon)) ].

This allows ordinary reverse-mode AD through the sampled computation.

This is the reparameterization estimator.

Gaussian Example

Suppose

xN(μ,σ2). x \sim \mathcal{N}(\mu,\sigma^2).

Reparameterize:

x=μ+σϵ,ϵN(0,1). x = \mu + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1).

Then

L(μ,σ)=Eϵ[(μ+σϵ)]. L(\mu,\sigma) = \mathbb{E}_\epsilon [ \ell(\mu+\sigma\epsilon) ].

Gradients become

μL=E[x(x)], \nabla_\mu L = \mathbb{E} [ \nabla_x \ell(x) ],

and

σL=E[ϵx(x)]. \nabla_\sigma L = \mathbb{E} [ \epsilon \nabla_x \ell(x) ].

The stochasticity is isolated in ε. The remaining computation is differentiable.

This estimator usually has much lower variance than the score-function estimator.

Pathwise Derivatives

The reparameterization trick is also called the pathwise derivative estimator.

The derivative propagates through the sampled path itself:

epsilon -> sample x -> loss

The stochastic node becomes a differentiable transformation.

This makes probabilistic programs compatible with ordinary reverse-mode AD systems.

Comparison of Gradient Estimators

EstimatorRequires differentiable sample pathSupports discrete variablesVariance
score-functionnoyeshigh
reparameterizationyesusually nolower
finite differencesnoyesvery high
implicit estimatorspartialpartialmoderate

No estimator is uniformly best.

The choice depends on distribution structure and computational constraints.

Variance Reduction

Monte Carlo gradient estimators are noisy.

Variance reduction is therefore central in probabilistic differentiation.

Baselines

Subtract a constant b:

E[((x)b)θlogpθ(x)]. \mathbb{E} [ (\ell(x)-b)\nabla_\theta \log p_\theta(x) ].

The estimator remains unbiased because

E[θlogpθ(x)]=0. \mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0.

A good baseline reduces variance dramatically.

Control variates

Introduce correlated auxiliary estimators with known expectation.

Antithetic sampling

Use negatively correlated samples.

Rao-Blackwellization

Integrate analytically over some variables instead of sampling them.

These methods are essential for practical stochastic gradient estimation.

Discrete Random Variables

Discrete sampling is difficult because sampled values change discontinuously.

Suppose

xCategorical(pθ). x \sim \operatorname{Categorical}(p_\theta).

A tiny parameter perturbation may abruptly change the sampled category.

Ordinary pathwise differentiation fails because:

xθ \frac{\partial x}{\partial \theta}

does not exist in the classical sense.

The score-function estimator still works because it differentiates the probability distribution rather than the sampled value.

Relaxed Distributions

A common workaround replaces discrete variables with continuous approximations.

For categorical sampling, the Gumbel-Softmax trick uses:

yi=exp((logpi+gi)/τ)jexp((logpj+gj)/τ), y_i = \frac{ \exp((\log p_i + g_i)/\tau) }{ \sum_j \exp((\log p_j + g_j)/\tau) },

where:

giGumbel(0,1). g_i \sim \operatorname{Gumbel}(0,1).

As temperature τ approaches zero, the relaxed sample approaches a one-hot discrete sample.

For finite temperature, the sample remains differentiable.

This allows approximate pathwise differentiation through discrete choices.

Probabilistic Computational Graphs

A probabilistic program can be represented as a graph containing both deterministic and stochastic nodes.

Example:

theta -> z ~ p(z|theta)
z -> x ~ p(x|z)
x -> loss

Differentiation propagates through:

Node typeGradient rule
deterministic nodeordinary chain rule
stochastic nodeestimator-specific rule

Modern probabilistic programming systems combine AD with stochastic estimators to differentiate entire probabilistic models.

Variational Inference

Variational inference optimizes an approximate distribution

qϕ(z) q_\phi(z)

to approximate a target posterior.

The evidence lower bound (ELBO) is

L(ϕ)=Ezqϕ[logp(x,z)logqϕ(z)]. \mathcal{L}(\phi) = \mathbb{E}_{z\sim q_\phi} [ \log p(x,z)-\log q_\phi(z) ].

Gradients require differentiating expectations over learned distributions.

Reparameterization gradients made deep variational models practical.

Variational autoencoders are a canonical example.

Variational Autoencoders

A variational autoencoder defines:

ComponentRole
encoderparameterizes latent distribution
latent variablesampled representation
decoderreconstructs data

The encoder predicts:

μ(x),σ(x). \mu(x),\quad \sigma(x).

A latent sample is drawn using reparameterization:

z=μ+σϵ. z=\mu+\sigma\epsilon.

The decoder computes reconstruction loss.

Reverse-mode AD then differentiates the entire stochastic pipeline.

Without reparameterization, efficient training would be much harder.

Probabilistic Programs

A probabilistic program includes random choices:

z = sample(normal(mu, sigma))
x = sample(decoder(z))
observe(data, x)

The program defines a probability distribution over execution traces.

Differentiation may involve:

QuantityMeaning
log probabilitylikelihood gradient
posterior expectationinference objective
sampled trajectorysimulation sensitivity

Probabilistic AD systems combine tracing, sampling, and reverse-mode differentiation.

Monte Carlo Differentiation

Suppose

L(θ)=E[fθ(X)]. L(\theta)=\mathbb{E}[f_\theta(X)].

Monte Carlo estimates use samples:

LN(θ)=1Ni=1Nfθ(Xi). L_N(\theta) = \frac{1}{N} \sum_{i=1}^N f_\theta(X_i).

Differentiating gives

θLN=1Niθfθ(Xi). \nabla_\theta L_N = \frac{1}{N} \sum_i \nabla_\theta f_\theta(X_i).

This estimator itself becomes random.

Thus optimization uses stochastic gradients of stochastic objectives.

Understanding variance propagation becomes critical.

Stochastic Differential Equations

Probabilistic dynamics often use stochastic differential equations:

dz=f(z,t)dt+g(z,t)dWt, dz = f(z,t)\,dt + g(z,t)\,dW_t,

where W_t is Brownian motion.

These systems appear in:

DomainExample
diffusion modelsgenerative modeling
financestochastic volatility
physicsthermal noise
biologyrandom population dynamics

Differentiation through SDEs requires handling stochastic integrals and noise-dependent trajectories.

Diffusion Models

Modern generative diffusion models evolve data through noisy dynamics.

Forward diffusion adds noise:

dx=12β(t)xdt+β(t)dWt. dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dW_t.

Reverse diffusion learns to invert the stochastic process.

Training involves expectations over noisy trajectories and repeated stochastic sampling.

Probabilistic AD is fundamental to these systems.

Measure-Theoretic Issues

Differentiating probabilistic systems introduces mathematical subtleties.

Questions include:

QuestionIssue
can derivative move inside expectation?dominated convergence
does density exist?measure regularity
is estimator unbiased?interchange of limits
does variance exist?integrability

Many practical estimators rely on assumptions that may fail in heavy-tailed or discontinuous systems.

Stochastic Control Flow

Programs with random branching are especially difficult.

Example:

if sample(bernoulli(p)):
    y = f1(theta)
else:
    y = f2(theta)

The execution trace itself becomes stochastic.

The derivative depends on both:

ComponentEffect
branch probabilityscore-function term
branch computationpathwise term

Hybrid estimators are often required.

Gradient Estimator Bias

Some estimators are unbiased:

E[g^]=θL. \mathbb{E}[\hat{g}] = \nabla_\theta L.

Others trade bias for lower variance.

A low-variance biased estimator may outperform a theoretically correct unbiased estimator in optimization.

This creates a central engineering tradeoff:

GoalCost
unbiasednesshigh variance
low variancepossible bias

Practical probabilistic learning often prefers stable optimization over exact gradient fidelity.

Probabilistic AD Systems

Modern systems combine:

CapabilityPurpose
reverse-mode ADdeterministic differentiation
stochastic estimatorsrandom variables
trace graphsprobabilistic execution
Monte Carlo samplingexpectation approximation
symbolic density trackinglog-likelihood computation

Examples include probabilistic programming frameworks and differentiable simulators.

Failure Modes

Probabilistic differentiation introduces many instability sources.

High variance

Gradient estimates may fluctuate wildly.

Rare-event instability

Extreme samples dominate gradients.

Discontinuous sampling

Discrete variables create undefined pathwise derivatives.

Monte Carlo noise

Optimization may become noisy or biased.

Numerical underflow

Tiny probabilities destabilize log-likelihoods.

Correlated randomness

Dependent samples complicate variance analysis.

These issues often dominate runtime behavior.

Conceptual Shift

Classical AD differentiates functions.

Probabilistic AD differentiates distributions, expectations, and stochastic processes.

The chain rule alone is no longer sufficient. Gradient estimation becomes a statistical problem as well as a computational one.

This changes the meaning of differentiation itself.

Instead of asking:

dydθ, \frac{dy}{d\theta},

we ask:

θE[y]. \nabla_\theta \mathbb{E}[y].

The derivative becomes an expectation over random trajectories.

Summary

Probabilistic automatic differentiation extends AD into stochastic systems.

Differentiation may occur through expectations, random variables, Monte Carlo estimators, stochastic differential equations, or probabilistic programs.

The two dominant techniques are:

MethodCore idea
score-function estimatorsdifferentiate probabilities
reparameterization estimatorsdifferentiate sampled paths

These methods make modern probabilistic machine learning practical, including variational inference, stochastic simulation, diffusion models, and probabilistic programming.

The central challenge is no longer only correctness of the chain rule. It is managing variance, bias, stochasticity, and numerical stability while preserving useful gradients through random computation.