# Probabilistic Programming

## Probabilistic Programming

Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.

Automatic differentiation is important because most modern probabilistic inference methods rely on gradients of probability densities, likelihoods, or variational objectives.

A probabilistic model typically defines:

$$
x \sim p(x \mid \theta),
$$

where:

| Symbol | Meaning |
|---|---|
| $x$ | latent variables |
| $\theta$ | model parameters |
| $p$ | probability distribution |

Given observations $y$, inference computes the posterior:

$$
p(x,\theta \mid y).
$$

AD enables efficient optimization and sampling in high-dimensional probabilistic systems.

### Probabilistic Programs as Computation Graphs

A probabilistic program mixes deterministic computation with random sampling.

Example:

```text
latent variables
    -> deterministic transforms
    -> likelihood computation
    -> log probability
    -> inference objective
```

The deterministic parts are ordinary differentiable programs. The probabilistic parts introduce distributions, expectations, and stochasticity.

Inference algorithms often differentiate quantities such as:

$$
\log p(y \mid \theta),
$$

$$
\log p(x,\theta,y),
$$

or variational objectives.

### Log Probability and Gradients

Most gradient-based inference methods work with log densities.

Suppose

$$
p(x \mid y,\theta)
\propto
p(y \mid x,\theta)p(x \mid \theta).
$$

Then the log posterior is

$$
\log p(x \mid y,\theta) =
\log p(y \mid x,\theta)
+
\log p(x \mid \theta) -
\log Z.
$$

The normalization constant $Z$ is usually ignored during gradient computation because it does not depend on $x$.

AD computes gradients such as

$$
\nabla_x \log p(x \mid y,\theta).
$$

These gradients drive modern inference algorithms.

### Hamiltonian Monte Carlo

Hamiltonian Monte Carlo augments latent variables with momentum variables:

$$
(x,p).
$$

The Hamiltonian is

$$
H(x,p) =
U(x)+K(p),
$$

where

$$
U(x)=-\log p(x \mid y),
$$

and

$$
K(p)=\frac{1}{2}p^\top M^{-1}p.
$$

The dynamics are

$$
\frac{dx}{dt}=\frac{\partial H}{\partial p},
\qquad
\frac{dp}{dt}=-\frac{\partial H}{\partial x}.
$$

AD computes

$$
\nabla_x U(x),
$$

which is the gradient of the negative log posterior.

Without AD, deriving gradients for large probabilistic models would be extremely tedious.

### Variational Inference

Variational inference approximates the posterior using a simpler distribution

$$
q_\phi(x).
$$

The objective is usually the evidence lower bound:

$$
\mathcal{L}(\phi) =
\mathbb{E}_{q_\phi(x)}
\left[
\log p(x,y)-\log q_\phi(x)
\right].
$$

The goal is

$$
\max_\phi \mathcal{L}(\phi).
$$

AD computes gradients of this objective with respect to variational parameters.

### Reparameterization Trick

Direct differentiation through random sampling is difficult.

Suppose

$$
x \sim q_\phi(x).
$$

Instead of sampling $x$ directly, we write

$$
x = g(\epsilon,\phi),
\qquad
\epsilon \sim p(\epsilon),
$$

where the randomness is separated from the parameters.

For a Gaussian:

$$
x = \mu + \sigma \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,1).
$$

Now expectations become

$$
\mathbb{E}_{q_\phi(x)}[f(x)] =
\mathbb{E}_{p(\epsilon)}
[f(g(\epsilon,\phi))].
$$

AD can differentiate through $g$.

This produces lower-variance gradient estimators than score-function methods.

### Score-Function Estimators

Some distributions cannot be reparameterized easily.

In that case, we use the identity

$$
\nabla_\phi
\mathbb{E}_{q_\phi(x)}[f(x)] =
\mathbb{E}_{q_\phi(x)}
\left[
f(x)\nabla_\phi \log q_\phi(x)
\right].
$$

This is the score-function estimator, also called REINFORCE.

It works for discrete variables and non-reparameterizable distributions, but it often has high variance.

| Method | Strength | Weakness |
|---|---|---|
| Reparameterization | Low variance | Requires differentiable transform |
| Score-function estimator | General | High variance |
| Hybrid methods | Flexible | More complex |

AD computes the deterministic derivatives inside these estimators.

### Stochastic Computation Graphs

Probabilistic programs are stochastic computation graphs.

Nodes may represent:

| Node type | Meaning |
|---|---|
| Deterministic node | Ordinary differentiable computation |
| Random node | Sampling operation |
| Observation node | Conditioning on data |
| Objective node | Log probability or reward |

Gradient propagation depends on node type.

Deterministic paths use ordinary chain rule differentiation. Random paths require estimator identities.

### Discrete Random Variables

Discrete latent variables create differentiation problems because sampling is discontinuous.

Example:

$$
z \sim \operatorname{Categorical}(\pi).
$$

The sampled value changes discontinuously as $\pi$ changes.

Approaches include:

| Method | Idea |
|---|---|
| Score-function estimator | Differentiate probability, not sample |
| Gumbel-softmax | Continuous relaxation |
| Marginalization | Sum exactly over states |
| Straight-through estimator | Approximate backward pass |

The Gumbel-softmax relaxation replaces hard categories with differentiable soft assignments:

$$
y_i =
\frac{
\exp((\log \pi_i + g_i)/\tau)
}{
\sum_j \exp((\log \pi_j + g_j)/\tau)
}.
$$

As temperature $\tau$ decreases, the distribution becomes more discrete.

### Probabilistic Graphical Models

Many probabilistic systems are structured graphical models.

Examples include:

- Bayesian networks,
- hidden Markov models,
- factor graphs,
- state-space models,
- probabilistic circuits.

Inference often reduces to repeated local message computations.

AD is useful for:

- learning parameters,
- differentiable message passing,
- variational optimization,
- differentiable filtering and smoothing.

### State-Space Models

A probabilistic state-space model evolves latent states:

$$
x_{t+1}\sim p(x_{t+1}\mid x_t),
$$

$$
y_t\sim p(y_t\mid x_t).
$$

Inference computes posterior trajectories:

$$
p(x_{0:T}\mid y_{0:T}).
$$

Differentiable filtering and smoothing algorithms allow gradient-based learning of:

- transition dynamics,
- observation models,
- noise covariances,
- control parameters.

Kalman filters are differentiable under standard linear algebra assumptions. Particle filters are more difficult because resampling is discrete.

### Particle Filters

Particle filters approximate distributions using weighted samples.

A resampling step selects particles according to weights:

$$
w_i \propto p(y_t \mid x_i).
$$

Resampling is discrete and therefore non-differentiable in the usual sense.

Differentiable particle filtering methods use:

| Method | Idea |
|---|---|
| Soft resampling | Smooth weight redistribution |
| Relaxed categorical sampling | Continuous approximation |
| Gradient estimators | Differentiate expectation |
| Deterministic transport | Replace stochastic resampling |

This remains an active research area.

### Bayesian Neural Networks

A Bayesian neural network places distributions over weights:

$$
w \sim p(w).
$$

Predictions marginalize over weights:

$$
p(y \mid x) =
\int p(y \mid x,w)p(w)\,dw.
$$

Variational inference approximates the posterior over weights.

AD computes gradients through:

- neural network computations,
- variational objectives,
- stochastic parameter samples.

This combines probabilistic inference with deep learning infrastructure.

### Normalizing Flows

A normalizing flow transforms a simple distribution into a complex one using invertible mappings.

Suppose

$$
z_0 \sim p_0(z),
$$

and

$$
z_K = f_K \circ \cdots \circ f_1(z_0).
$$

The density becomes

$$
\log p(z_K) =
\log p_0(z_0) -
\sum_k
\log
\left|
\det
\frac{\partial f_k}{\partial z_{k-1}}
\right|.
$$

AD computes:

- Jacobian determinants,
- parameter gradients,
- inverse transform sensitivities.

Flows are fundamentally differentiable probabilistic systems.

### Probabilistic Programs with Control Flow

Probabilistic programs may contain loops, branching, and recursion.

Example:

```text
sample latent state
if condition:
    sample another variable
else:
    apply deterministic transform
accumulate log probability
```

Control flow creates challenges:

| Issue | Consequence |
|---|---|
| Dynamic graph structure | Variable execution paths |
| Discrete branching | Piecewise derivatives |
| Variable-dimensional latent space | Complex gradient semantics |
| Recursive stochastic structure | Dynamic memory requirements |

Modern probabilistic programming systems often use tracing or graph capture to represent execution.

### Log-Density Stability

Probabilistic inference is numerically sensitive.

Probabilities may underflow:

$$
p(x)\approx 10^{-300}.
$$

Therefore computations are usually performed in log space.

AD systems must preserve stable numerical forms such as:

$$
\log \sum_i e^{x_i}.
$$

Direct exponentiation and summation can produce catastrophic overflow or underflow.

Stable primitives should expose stable derivatives.

### Automatic Differentiation Variational Inference

Automatic differentiation variational inference, or ADVI, treats inference generically:

1. transform latent variables to unconstrained space,
2. define variational distribution,
3. estimate ELBO gradients,
4. optimize with stochastic gradient methods.

The important idea is that inference becomes a differentiable optimization problem.

The probabilistic model no longer needs manually derived updates.

### Constraints and Transformations

Latent variables may have constraints:

| Variable type | Constraint |
|---|---|
| Variance | positive |
| Probability vector | sums to one |
| Covariance matrix | positive definite |
| Correlation | bounded |

Inference systems transform constrained variables into unconstrained coordinates.

Examples:

$$
x = \exp(z),
$$

$$
p_i =
\frac{\exp(z_i)}{\sum_j \exp(z_j)}.
$$

AD differentiates through these transforms automatically.

### Sparse and Structured Probabilistic Systems

Large probabilistic systems often have sparse dependence graphs.

Examples:

- sparse Gaussian processes,
- factor graphs,
- probabilistic graphical models,
- structured variational families.

Efficient differentiation exploits this structure.

Dense Jacobian construction is usually infeasible.

### Failure Modes

Differentiable probabilistic systems fail in characteristic ways.

| Failure mode | Cause |
|---|---|
| Exploding gradients | Sharp posterior geometry |
| Vanishing gradients | Saturated likelihoods |
| High-variance estimators | Stochastic gradient noise |
| Unstable ELBO optimization | Poor variational family |
| NaNs in log probability | Underflow or invalid parameters |
| Broken inference | Discrete non-differentiable operations |
| Memory explosion | Reverse mode through long stochastic traces |

Gradient correctness alone does not guarantee good inference behavior.

### Practical Architecture

A robust differentiable probabilistic system typically separates:

| Layer | Responsibility |
|---|---|
| Random variable representation | Sampling and constraints |
| Deterministic transforms | Differentiable computation |
| Log-density accumulation | Stable probability evaluation |
| Inference objective | ELBO, posterior log density, likelihood |
| Gradient estimator | Reparameterization or score-function |
| Optimizer or sampler | Variational optimization or MCMC |

Each layer needs carefully defined derivative semantics.

### Summary

Probabilistic programming turns inference into a differentiable computation problem. Automatic differentiation supplies gradients for posterior densities, variational objectives, stochastic simulators, and probabilistic transformations.

The main challenges are stochastic sampling, discrete variables, estimator variance, dynamic control flow, and numerical stability in probability computations. Effective systems combine AD with reparameterization, stable log-density algebra, structured inference algorithms, and carefully designed gradient estimators.

