# Probabilistic Automatic Differentiation

## Probabilistic Automatic Differentiation

Classical automatic differentiation computes derivatives of deterministic programs.

A probabilistic program instead describes random variables, probability distributions, and stochastic transformations. The output is not a single value but a distribution, expectation, likelihood, or sampled trajectory.

Probabilistic automatic differentiation studies how derivatives propagate through such systems.

This includes:

| Problem | Example |
|---|---|
| differentiating expectations | stochastic optimization |
| differentiating sampling procedures | variational inference |
| differentiating probabilistic programs | Bayesian learning |
| differentiating Monte Carlo estimators | simulation gradients |
| differentiating stochastic dynamics | diffusion models |

The central challenge is that randomness introduces discontinuities, variance, and estimator bias into the differentiation process.

## Deterministic vs Stochastic Computation

A deterministic computation defines

$$
y = f(x,\theta).
$$

A stochastic computation introduces random variables:

$$
y = f(x,\theta,\omega),
$$

where

$$
\omega \sim p(\omega).
$$

The quantity of interest is often an expectation:

$$
L(\theta) =
\mathbb{E}_{\omega \sim p_\theta}
[\ell(f(\theta,\omega))].
$$

The derivative becomes

$$
\nabla_\theta L(\theta).
$$

The difficulty is that both the sampled value and the distribution itself may depend on `θ`.

## Differentiating Expectations

Suppose

$$
L(\theta)=\mathbb{E}_{x\sim p_\theta(x)}[\ell(x)].
$$

Expanding the expectation:

$$
L(\theta) =
\int \ell(x)p_\theta(x)\,dx.
$$

Differentiate under the integral:

$$
\nabla_\theta L =
\int \ell(x)\nabla_\theta p_\theta(x)\,dx.
$$

Using

$$
\nabla_\theta p_\theta(x) =
p_\theta(x)\nabla_\theta \log p_\theta(x),
$$

we obtain

$$
\nabla_\theta L =
\mathbb{E}_{x\sim p_\theta}
[
\ell(x)\nabla_\theta \log p_\theta(x)
].
$$

This is the score-function estimator.

It is also called:

| Name | Context |
|---|---|
| REINFORCE estimator | reinforcement learning |
| likelihood-ratio estimator | statistics |
| score-function gradient | probabilistic inference |

The estimator does not require differentiating through the sampled value itself.

## Score-Function Estimator

The score-function estimator is

$$
\nabla_\theta
\mathbb{E}_{x\sim p_\theta}
[\ell(x)] =
\mathbb{E}
[
\ell(x)\nabla_\theta \log p_\theta(x)
].
$$

Monte Carlo approximation gives

$$
\nabla_\theta L
\approx
\frac{1}{N}
\sum_{i=1}^N
\ell(x_i)\nabla_\theta \log p_\theta(x_i).
$$

This estimator is general.

It works even when:

| Situation | Supported |
|---|---|
| discrete random variables | yes |
| non-differentiable samples | yes |
| black-box simulators | yes |

However, it often has very high variance.

Large variance leads to unstable optimization and slow convergence.

## Reparameterization Trick

Suppose samples can be written as

$$
x = g(\theta,\epsilon),
\qquad
\epsilon \sim p(\epsilon),
$$

where the randomness is independent of `θ`.

Then

$$
L(\theta) =
\mathbb{E}_{\epsilon}
[
\ell(g(\theta,\epsilon))
].
$$

Now the expectation is over a fixed distribution. The derivative becomes

$$
\nabla_\theta L =
\mathbb{E}_{\epsilon}
[
\nabla_\theta \ell(g(\theta,\epsilon))
].
$$

This allows ordinary reverse-mode AD through the sampled computation.

This is the reparameterization estimator.

## Gaussian Example

Suppose

$$
x \sim \mathcal{N}(\mu,\sigma^2).
$$

Reparameterize:

$$
x = \mu + \sigma \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,1).
$$

Then

$$
L(\mu,\sigma) =
\mathbb{E}_\epsilon
[
\ell(\mu+\sigma\epsilon)
].
$$

Gradients become

$$
\nabla_\mu L =
\mathbb{E}
[
\nabla_x \ell(x)
],
$$

and

$$
\nabla_\sigma L =
\mathbb{E}
[
\epsilon \nabla_x \ell(x)
].
$$

The stochasticity is isolated in `ε`. The remaining computation is differentiable.

This estimator usually has much lower variance than the score-function estimator.

## Pathwise Derivatives

The reparameterization trick is also called the pathwise derivative estimator.

The derivative propagates through the sampled path itself:

```text
epsilon -> sample x -> loss
```

The stochastic node becomes a differentiable transformation.

This makes probabilistic programs compatible with ordinary reverse-mode AD systems.

## Comparison of Gradient Estimators

| Estimator | Requires differentiable sample path | Supports discrete variables | Variance |
|---|---|---|---|
| score-function | no | yes | high |
| reparameterization | yes | usually no | lower |
| finite differences | no | yes | very high |
| implicit estimators | partial | partial | moderate |

No estimator is uniformly best.

The choice depends on distribution structure and computational constraints.

## Variance Reduction

Monte Carlo gradient estimators are noisy.

Variance reduction is therefore central in probabilistic differentiation.

### Baselines

Subtract a constant `b`:

$$
\mathbb{E}
[
(\ell(x)-b)\nabla_\theta \log p_\theta(x)
].
$$

The estimator remains unbiased because

$$
\mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0.
$$

A good baseline reduces variance dramatically.

### Control variates

Introduce correlated auxiliary estimators with known expectation.

### Antithetic sampling

Use negatively correlated samples.

### Rao-Blackwellization

Integrate analytically over some variables instead of sampling them.

These methods are essential for practical stochastic gradient estimation.

## Discrete Random Variables

Discrete sampling is difficult because sampled values change discontinuously.

Suppose

$$
x \sim \operatorname{Categorical}(p_\theta).
$$

A tiny parameter perturbation may abruptly change the sampled category.

Ordinary pathwise differentiation fails because:

$$
\frac{\partial x}{\partial \theta}
$$

does not exist in the classical sense.

The score-function estimator still works because it differentiates the probability distribution rather than the sampled value.

## Relaxed Distributions

A common workaround replaces discrete variables with continuous approximations.

For categorical sampling, the Gumbel-Softmax trick uses:

$$
y_i =
\frac{
\exp((\log p_i + g_i)/\tau)
}{
\sum_j
\exp((\log p_j + g_j)/\tau)
},
$$

where:

$$
g_i \sim \operatorname{Gumbel}(0,1).
$$

As temperature `τ` approaches zero, the relaxed sample approaches a one-hot discrete sample.

For finite temperature, the sample remains differentiable.

This allows approximate pathwise differentiation through discrete choices.

## Probabilistic Computational Graphs

A probabilistic program can be represented as a graph containing both deterministic and stochastic nodes.

Example:

```text
theta -> z ~ p(z|theta)
z -> x ~ p(x|z)
x -> loss
```

Differentiation propagates through:

| Node type | Gradient rule |
|---|---|
| deterministic node | ordinary chain rule |
| stochastic node | estimator-specific rule |

Modern probabilistic programming systems combine AD with stochastic estimators to differentiate entire probabilistic models.

## Variational Inference

Variational inference optimizes an approximate distribution

$$
q_\phi(z)
$$

to approximate a target posterior.

The evidence lower bound (ELBO) is

$$
\mathcal{L}(\phi) =
\mathbb{E}_{z\sim q_\phi}
[
\log p(x,z)-\log q_\phi(z)
].
$$

Gradients require differentiating expectations over learned distributions.

Reparameterization gradients made deep variational models practical.

Variational autoencoders are a canonical example.

## Variational Autoencoders

A variational autoencoder defines:

| Component | Role |
|---|---|
| encoder | parameterizes latent distribution |
| latent variable | sampled representation |
| decoder | reconstructs data |

The encoder predicts:

$$
\mu(x),\quad \sigma(x).
$$

A latent sample is drawn using reparameterization:

$$
z=\mu+\sigma\epsilon.
$$

The decoder computes reconstruction loss.

Reverse-mode AD then differentiates the entire stochastic pipeline.

Without reparameterization, efficient training would be much harder.

## Probabilistic Programs

A probabilistic program includes random choices:

```text
z = sample(normal(mu, sigma))
x = sample(decoder(z))
observe(data, x)
```

The program defines a probability distribution over execution traces.

Differentiation may involve:

| Quantity | Meaning |
|---|---|
| log probability | likelihood gradient |
| posterior expectation | inference objective |
| sampled trajectory | simulation sensitivity |

Probabilistic AD systems combine tracing, sampling, and reverse-mode differentiation.

## Monte Carlo Differentiation

Suppose

$$
L(\theta)=\mathbb{E}[f_\theta(X)].
$$

Monte Carlo estimates use samples:

$$
L_N(\theta) =
\frac{1}{N}
\sum_{i=1}^N
f_\theta(X_i).
$$

Differentiating gives

$$
\nabla_\theta L_N =
\frac{1}{N}
\sum_i
\nabla_\theta f_\theta(X_i).
$$

This estimator itself becomes random.

Thus optimization uses stochastic gradients of stochastic objectives.

Understanding variance propagation becomes critical.

## Stochastic Differential Equations

Probabilistic dynamics often use stochastic differential equations:

$$
dz = f(z,t)\,dt + g(z,t)\,dW_t,
$$

where `W_t` is Brownian motion.

These systems appear in:

| Domain | Example |
|---|---|
| diffusion models | generative modeling |
| finance | stochastic volatility |
| physics | thermal noise |
| biology | random population dynamics |

Differentiation through SDEs requires handling stochastic integrals and noise-dependent trajectories.

## Diffusion Models

Modern generative diffusion models evolve data through noisy dynamics.

Forward diffusion adds noise:

$$
dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dW_t.
$$

Reverse diffusion learns to invert the stochastic process.

Training involves expectations over noisy trajectories and repeated stochastic sampling.

Probabilistic AD is fundamental to these systems.

## Measure-Theoretic Issues

Differentiating probabilistic systems introduces mathematical subtleties.

Questions include:

| Question | Issue |
|---|---|
| can derivative move inside expectation? | dominated convergence |
| does density exist? | measure regularity |
| is estimator unbiased? | interchange of limits |
| does variance exist? | integrability |

Many practical estimators rely on assumptions that may fail in heavy-tailed or discontinuous systems.

## Stochastic Control Flow

Programs with random branching are especially difficult.

Example:

```text
if sample(bernoulli(p)):
    y = f1(theta)
else:
    y = f2(theta)
```

The execution trace itself becomes stochastic.

The derivative depends on both:

| Component | Effect |
|---|---|
| branch probability | score-function term |
| branch computation | pathwise term |

Hybrid estimators are often required.

## Gradient Estimator Bias

Some estimators are unbiased:

$$
\mathbb{E}[\hat{g}] = \nabla_\theta L.
$$

Others trade bias for lower variance.

A low-variance biased estimator may outperform a theoretically correct unbiased estimator in optimization.

This creates a central engineering tradeoff:

| Goal | Cost |
|---|---|
| unbiasedness | high variance |
| low variance | possible bias |

Practical probabilistic learning often prefers stable optimization over exact gradient fidelity.

## Probabilistic AD Systems

Modern systems combine:

| Capability | Purpose |
|---|---|
| reverse-mode AD | deterministic differentiation |
| stochastic estimators | random variables |
| trace graphs | probabilistic execution |
| Monte Carlo sampling | expectation approximation |
| symbolic density tracking | log-likelihood computation |

Examples include probabilistic programming frameworks and differentiable simulators.

## Failure Modes

Probabilistic differentiation introduces many instability sources.

### High variance

Gradient estimates may fluctuate wildly.

### Rare-event instability

Extreme samples dominate gradients.

### Discontinuous sampling

Discrete variables create undefined pathwise derivatives.

### Monte Carlo noise

Optimization may become noisy or biased.

### Numerical underflow

Tiny probabilities destabilize log-likelihoods.

### Correlated randomness

Dependent samples complicate variance analysis.

These issues often dominate runtime behavior.

## Conceptual Shift

Classical AD differentiates functions.

Probabilistic AD differentiates distributions, expectations, and stochastic processes.

The chain rule alone is no longer sufficient. Gradient estimation becomes a statistical problem as well as a computational one.

This changes the meaning of differentiation itself.

Instead of asking:

$$
\frac{dy}{d\theta},
$$

we ask:

$$
\nabla_\theta
\mathbb{E}[y].
$$

The derivative becomes an expectation over random trajectories.

## Summary

Probabilistic automatic differentiation extends AD into stochastic systems.

Differentiation may occur through expectations, random variables, Monte Carlo estimators, stochastic differential equations, or probabilistic programs.

The two dominant techniques are:

| Method | Core idea |
|---|---|
| score-function estimators | differentiate probabilities |
| reparameterization estimators | differentiate sampled paths |

These methods make modern probabilistic machine learning practical, including variational inference, stochastic simulation, diffusion models, and probabilistic programming.

The central challenge is no longer only correctness of the chain rule. It is managing variance, bias, stochasticity, and numerical stability while preserving useful gradients through random computation.