Classical automatic differentiation computes derivatives of deterministic programs.
Classical automatic differentiation computes derivatives of deterministic programs.
A probabilistic program instead describes random variables, probability distributions, and stochastic transformations. The output is not a single value but a distribution, expectation, likelihood, or sampled trajectory.
Probabilistic automatic differentiation studies how derivatives propagate through such systems.
This includes:
| Problem | Example |
|---|---|
| differentiating expectations | stochastic optimization |
| differentiating sampling procedures | variational inference |
| differentiating probabilistic programs | Bayesian learning |
| differentiating Monte Carlo estimators | simulation gradients |
| differentiating stochastic dynamics | diffusion models |
The central challenge is that randomness introduces discontinuities, variance, and estimator bias into the differentiation process.
Deterministic vs Stochastic Computation
A deterministic computation defines
A stochastic computation introduces random variables:
where
The quantity of interest is often an expectation:
The derivative becomes
The difficulty is that both the sampled value and the distribution itself may depend on θ.
Differentiating Expectations
Suppose
Expanding the expectation:
Differentiate under the integral:
Using
we obtain
This is the score-function estimator.
It is also called:
| Name | Context |
|---|---|
| REINFORCE estimator | reinforcement learning |
| likelihood-ratio estimator | statistics |
| score-function gradient | probabilistic inference |
The estimator does not require differentiating through the sampled value itself.
Score-Function Estimator
The score-function estimator is
Monte Carlo approximation gives
This estimator is general.
It works even when:
| Situation | Supported |
|---|---|
| discrete random variables | yes |
| non-differentiable samples | yes |
| black-box simulators | yes |
However, it often has very high variance.
Large variance leads to unstable optimization and slow convergence.
Reparameterization Trick
Suppose samples can be written as
where the randomness is independent of θ.
Then
Now the expectation is over a fixed distribution. The derivative becomes
This allows ordinary reverse-mode AD through the sampled computation.
This is the reparameterization estimator.
Gaussian Example
Suppose
Reparameterize:
Then
Gradients become
and
The stochasticity is isolated in ε. The remaining computation is differentiable.
This estimator usually has much lower variance than the score-function estimator.
Pathwise Derivatives
The reparameterization trick is also called the pathwise derivative estimator.
The derivative propagates through the sampled path itself:
epsilon -> sample x -> lossThe stochastic node becomes a differentiable transformation.
This makes probabilistic programs compatible with ordinary reverse-mode AD systems.
Comparison of Gradient Estimators
| Estimator | Requires differentiable sample path | Supports discrete variables | Variance |
|---|---|---|---|
| score-function | no | yes | high |
| reparameterization | yes | usually no | lower |
| finite differences | no | yes | very high |
| implicit estimators | partial | partial | moderate |
No estimator is uniformly best.
The choice depends on distribution structure and computational constraints.
Variance Reduction
Monte Carlo gradient estimators are noisy.
Variance reduction is therefore central in probabilistic differentiation.
Baselines
Subtract a constant b:
The estimator remains unbiased because
A good baseline reduces variance dramatically.
Control variates
Introduce correlated auxiliary estimators with known expectation.
Antithetic sampling
Use negatively correlated samples.
Rao-Blackwellization
Integrate analytically over some variables instead of sampling them.
These methods are essential for practical stochastic gradient estimation.
Discrete Random Variables
Discrete sampling is difficult because sampled values change discontinuously.
Suppose
A tiny parameter perturbation may abruptly change the sampled category.
Ordinary pathwise differentiation fails because:
does not exist in the classical sense.
The score-function estimator still works because it differentiates the probability distribution rather than the sampled value.
Relaxed Distributions
A common workaround replaces discrete variables with continuous approximations.
For categorical sampling, the Gumbel-Softmax trick uses:
where:
As temperature τ approaches zero, the relaxed sample approaches a one-hot discrete sample.
For finite temperature, the sample remains differentiable.
This allows approximate pathwise differentiation through discrete choices.
Probabilistic Computational Graphs
A probabilistic program can be represented as a graph containing both deterministic and stochastic nodes.
Example:
theta -> z ~ p(z|theta)
z -> x ~ p(x|z)
x -> lossDifferentiation propagates through:
| Node type | Gradient rule |
|---|---|
| deterministic node | ordinary chain rule |
| stochastic node | estimator-specific rule |
Modern probabilistic programming systems combine AD with stochastic estimators to differentiate entire probabilistic models.
Variational Inference
Variational inference optimizes an approximate distribution
to approximate a target posterior.
The evidence lower bound (ELBO) is
Gradients require differentiating expectations over learned distributions.
Reparameterization gradients made deep variational models practical.
Variational autoencoders are a canonical example.
Variational Autoencoders
A variational autoencoder defines:
| Component | Role |
|---|---|
| encoder | parameterizes latent distribution |
| latent variable | sampled representation |
| decoder | reconstructs data |
The encoder predicts:
A latent sample is drawn using reparameterization:
The decoder computes reconstruction loss.
Reverse-mode AD then differentiates the entire stochastic pipeline.
Without reparameterization, efficient training would be much harder.
Probabilistic Programs
A probabilistic program includes random choices:
z = sample(normal(mu, sigma))
x = sample(decoder(z))
observe(data, x)The program defines a probability distribution over execution traces.
Differentiation may involve:
| Quantity | Meaning |
|---|---|
| log probability | likelihood gradient |
| posterior expectation | inference objective |
| sampled trajectory | simulation sensitivity |
Probabilistic AD systems combine tracing, sampling, and reverse-mode differentiation.
Monte Carlo Differentiation
Suppose
Monte Carlo estimates use samples:
Differentiating gives
This estimator itself becomes random.
Thus optimization uses stochastic gradients of stochastic objectives.
Understanding variance propagation becomes critical.
Stochastic Differential Equations
Probabilistic dynamics often use stochastic differential equations:
where W_t is Brownian motion.
These systems appear in:
| Domain | Example |
|---|---|
| diffusion models | generative modeling |
| finance | stochastic volatility |
| physics | thermal noise |
| biology | random population dynamics |
Differentiation through SDEs requires handling stochastic integrals and noise-dependent trajectories.
Diffusion Models
Modern generative diffusion models evolve data through noisy dynamics.
Forward diffusion adds noise:
Reverse diffusion learns to invert the stochastic process.
Training involves expectations over noisy trajectories and repeated stochastic sampling.
Probabilistic AD is fundamental to these systems.
Measure-Theoretic Issues
Differentiating probabilistic systems introduces mathematical subtleties.
Questions include:
| Question | Issue |
|---|---|
| can derivative move inside expectation? | dominated convergence |
| does density exist? | measure regularity |
| is estimator unbiased? | interchange of limits |
| does variance exist? | integrability |
Many practical estimators rely on assumptions that may fail in heavy-tailed or discontinuous systems.
Stochastic Control Flow
Programs with random branching are especially difficult.
Example:
if sample(bernoulli(p)):
y = f1(theta)
else:
y = f2(theta)The execution trace itself becomes stochastic.
The derivative depends on both:
| Component | Effect |
|---|---|
| branch probability | score-function term |
| branch computation | pathwise term |
Hybrid estimators are often required.
Gradient Estimator Bias
Some estimators are unbiased:
Others trade bias for lower variance.
A low-variance biased estimator may outperform a theoretically correct unbiased estimator in optimization.
This creates a central engineering tradeoff:
| Goal | Cost |
|---|---|
| unbiasedness | high variance |
| low variance | possible bias |
Practical probabilistic learning often prefers stable optimization over exact gradient fidelity.
Probabilistic AD Systems
Modern systems combine:
| Capability | Purpose |
|---|---|
| reverse-mode AD | deterministic differentiation |
| stochastic estimators | random variables |
| trace graphs | probabilistic execution |
| Monte Carlo sampling | expectation approximation |
| symbolic density tracking | log-likelihood computation |
Examples include probabilistic programming frameworks and differentiable simulators.
Failure Modes
Probabilistic differentiation introduces many instability sources.
High variance
Gradient estimates may fluctuate wildly.
Rare-event instability
Extreme samples dominate gradients.
Discontinuous sampling
Discrete variables create undefined pathwise derivatives.
Monte Carlo noise
Optimization may become noisy or biased.
Numerical underflow
Tiny probabilities destabilize log-likelihoods.
Correlated randomness
Dependent samples complicate variance analysis.
These issues often dominate runtime behavior.
Conceptual Shift
Classical AD differentiates functions.
Probabilistic AD differentiates distributions, expectations, and stochastic processes.
The chain rule alone is no longer sufficient. Gradient estimation becomes a statistical problem as well as a computational one.
This changes the meaning of differentiation itself.
Instead of asking:
we ask:
The derivative becomes an expectation over random trajectories.
Summary
Probabilistic automatic differentiation extends AD into stochastic systems.
Differentiation may occur through expectations, random variables, Monte Carlo estimators, stochastic differential equations, or probabilistic programs.
The two dominant techniques are:
| Method | Core idea |
|---|---|
| score-function estimators | differentiate probabilities |
| reparameterization estimators | differentiate sampled paths |
These methods make modern probabilistic machine learning practical, including variational inference, stochastic simulation, diffusion models, and probabilistic programming.
The central challenge is no longer only correctness of the chain rule. It is managing variance, bias, stochasticity, and numerical stability while preserving useful gradients through random computation.