Probabilistic Circuits

Probabilistic circuits are tractable probabilistic models built from simple computational graphs. They represent probability distributions using a network of sum, product, and leaf nodes. The main goal is to keep probabilistic inference efficient while still allowing expressive models.

They are also known under related names such as sum-product networks, arithmetic circuits, and tractable probabilistic models. The exact terminology depends on the architecture and the literature, but the central idea is the same: represent a probability distribution as a circuit whose structure makes inference computationally tractable.

Probabilistic circuits occupy a different point in the generative modeling landscape from VAEs, GANs, flows, and diffusion models. They usually trade some neural-network flexibility for exact or efficient probabilistic computation.

Motivation

Many probabilistic models define useful distributions but make inference expensive. For example, in a general latent-variable model, computing a marginal probability may require summing or integrating over many hidden configurations.

A probabilistic circuit is designed so that common operations are efficient:

Operation	Meaning
Marginal inference	Compute probability while leaving some variables unobserved
Conditional inference	Compute probability after observing evidence
Sampling	Generate data from the model
Most probable explanation	Find a high-probability assignment
Likelihood evaluation	Compute $p(x)$ exactly or efficiently

The price is structural discipline. The circuit must obey certain constraints, such as decomposability and smoothness, so that these operations remain tractable.

Variables and Distributions

Let

X = (X_1, X_2, \ldots, X_d)

be a collection of random variables. A probabilistic circuit defines a joint distribution

p(X_1, X_2, \ldots, X_d).

For one observed example

x = (x_1, x_2, \ldots, x_d),

the circuit computes a scalar value

p(x).

Unlike an energy-based model, this value is usually normalized by construction. Unlike a neural autoregressive model, it does not have to factor variables in a fixed left-to-right order.

Circuit Structure

A probabilistic circuit is a directed acyclic graph. It usually contains three kinds of nodes.

Node type	Function
Leaf node	Represents a simple distribution over one or more variables
Product node	Multiplies child distributions
Sum node	Forms a weighted mixture of child distributions

The output node computes the probability of the full observation.

A simple circuit may compute

p(x_1,x_2) = 0.6\,p_1(x_1)p_2(x_2) + 0.4\,p_3(x_1)p_4(x_2).

This expression describes a mixture of two factorized distributions. The first mixture component has weight $0.6$ , and the second has weight $0.4$ .

Leaf Nodes

Leaf nodes define simple local distributions. Examples include:

Variable type	Possible leaf distribution
Binary	Bernoulli
Categorical	Categorical distribution
Continuous	Gaussian
Count-valued	Poisson
Bounded continuous	Beta distribution

For a binary variable $X_i$ , a Bernoulli leaf may compute

p(X_i = x_i) = \pi_i^{x_i}(1-\pi_i)^{1-x_i}.

For a continuous variable, a Gaussian leaf may compute

p(X_i=x_i) = \mathcal{N}(x_i;\mu_i,\sigma_i^2).

Leaf nodes are intentionally simple. Expressiveness comes from composing many simple leaves through sums and products.

Product Nodes

A product node multiplies the values of its children:

S(x) = \prod_{j=1}^{m} S_j(x).

Product nodes represent factorization. If two child nodes depend on disjoint sets of variables, their product defines a joint distribution over the union of those variables.

For example, if one child models $X_1$ and another child models $X_2$ , the product node computes

p(X_1=x_1)p(X_2=x_2).

This is valid when the circuit treats those variables as independent within that substructure.

Sum Nodes

A sum node computes a weighted sum of child values:

S(x) = \sum_{j=1}^{m} w_j S_j(x),

where

w_j \ge 0, \quad \sum_{j=1}^{m} w_j = 1.

Sum nodes represent mixtures. Each child corresponds to one alternative explanation or component.

A sum node can be interpreted as introducing a latent categorical variable. If the sum node has three children, the latent variable chooses one of three submodels.

Thus, sum nodes give probabilistic circuits latent-variable expressiveness without requiring expensive general inference.

Scope

The scope of a node is the set of variables that appear below that node.

For example:

a leaf over $X_1$ has scope $\{X_1\}$ ,
a product of leaves over $X_1$ and $X_2$ has scope $\{X_1,X_2\}$ ,
a sum over two children that both model $X_1,X_2$ has scope $\{X_1,X_2\}$ .

Scope is central because tractability depends on how scopes combine through the circuit.

Decomposability

A product node is decomposable if its children have disjoint scopes.

For a product node with children $S_1,\ldots,S_m$ , decomposability requires

\operatorname{scope}(S_i) \cap \operatorname{scope}(S_j) = \varnothing \quad \text{for } i\neq j.

This prevents the same variable from being multiplied against itself through different children.

For example, this product is decomposable:

p(X_1)p(X_2).

This product is not decomposable:

p_1(X_1)p_2(X_1).

Decomposability is one of the main reasons probabilistic circuits support efficient marginal inference.

Smoothness

A sum node is smooth if all of its children have the same scope.

For a sum node with children $S_1,\ldots,S_m$ , smoothness requires

\operatorname{scope}(S_1) = \operatorname{scope}(S_2) = \cdots = \operatorname{scope}(S_m).

This ensures that the sum mixes distributions over the same variables.

For example, this sum is smooth:

0.5\,p_1(X_1,X_2) + 0.5\,p_2(X_1,X_2).

This sum is not smooth:

0.5\,p_1(X_1) + 0.5\,p_2(X_1,X_2).

Smoothness makes marginalization and likelihood evaluation well-defined throughout the circuit.

Normalization

If the leaves are normalized distributions, sum weights are nonnegative and sum to one, product nodes are decomposable, and sum nodes are smooth, then the whole circuit can define a normalized probability distribution.

This is the key advantage of probabilistic circuits.

Many neural generative models define expressive functions but make exact normalization hard. Probabilistic circuits enforce structure so normalization can be maintained locally and composed globally.

Marginal Inference

Suppose we observe only some variables. Let

X = (X_{\text{obs}}, X_{\text{miss}}).

We may want

p(X_{\text{obs}} = x_{\text{obs}}) = \sum_{x_{\text{miss}}} p(x_{\text{obs}}, x_{\text{miss}}).

In a probabilistic circuit, this can often be computed by replacing missing-variable leaves with 1 and evaluating the circuit normally.

For example, if a Bernoulli leaf for $X_i$ is unobserved, then

\sum_{x_i \in \{0,1\}} p(X_i=x_i)=1.

The rest of the circuit propagates this marginalization efficiently.

This gives probabilistic circuits a major advantage for missing-data problems.

Conditional Inference

Conditional probabilities can be computed from joint and marginal probabilities:

p(x_A \mid x_B) = \frac{p(x_A,x_B)}{p(x_B)}.

Because probabilistic circuits can compute both numerator and denominator efficiently, conditional inference is usually tractable.

This is useful for:

imputation,
classification,
anomaly detection,
structured prediction,
uncertainty-aware systems.

Sampling

Sampling proceeds recursively.

At a leaf node, sample from the leaf distribution.

At a product node, sample independently from all children.

At a sum node, choose one child according to the mixture weights, then sample from that child.

This gives an efficient ancestral sampling procedure.

Unlike MCMC sampling in energy-based models, sampling from many probabilistic circuits does not require a long Markov chain.

Learning Parameters

The parameters of a probabilistic circuit include:

leaf distribution parameters,
sum-node weights,
sometimes structure parameters.

If the structure is fixed, parameter learning is often performed by maximum likelihood.

Given data

\mathcal{D}=\{x^{(1)},\ldots,x^{(N)}\},

we maximize

\sum_{n=1}^{N}\log p_\theta(x^{(n)}).

Gradient-based optimization can be used, and for some circuit classes expectation-maximization is also possible.

Learning Structure

Learning the structure of a probabilistic circuit is harder than learning parameters.

Structure learning decides:

which variables are grouped together,
where products are placed,
where mixtures are placed,
how many components are used,
what leaf distributions appear.

Several approaches exist:

Approach	Description
Top-down partitioning	Recursively split variables and data
Clustering-based methods	Use data clusters to define sum nodes
Random structures	Sample valid structures and train parameters
Neural parameterization	Use neural networks inside or around the circuit
Hybrid models	Combine circuits with deep feature extractors

The structure determines the tradeoff between expressiveness and tractability.

Sum-Product Networks

Sum-product networks, or SPNs, are a widely studied form of probabilistic circuit.

An SPN is a directed acyclic graph with sum nodes, product nodes, and tractable leaves. Under the right structural conditions, SPNs support efficient exact inference.

SPNs became influential because they showed that deep probabilistic models could maintain tractable inference through architectural constraints.

They can be seen as deep mixture models with repeated factorization and recombination.

Arithmetic Circuits

Arithmetic circuits are closely related. They represent probability computations as arithmetic expressions.

For example, a Bayesian network can sometimes be compiled into an arithmetic circuit. Once compiled, repeated inference queries can be answered efficiently.

This gives a useful distinction:

Model type	Main role
Bayesian network	Human-readable graphical model
Arithmetic circuit	Efficient compiled inference form
Probabilistic circuit	Learnable tractable distribution model

The boundaries between these categories are not always strict.

Neural Probabilistic Circuits

Modern work sometimes combines probabilistic circuits with neural networks.

A neural network may produce:

leaf parameters,
mixture weights,
embeddings used by the circuit,
conditional distributions.

This hybrid approach tries to combine:

representation power from neural networks,
tractable inference from circuits.

For example, a CNN may encode an image into features, and a probabilistic circuit may model uncertainty over structured labels.

Comparison with Flow-Based Models

Probabilistic circuits	Flow-based models
Tractability through graph constraints	Tractability through invertibility
Efficient marginal inference	Exact likelihood and inverse mapping
Handles missing variables naturally	Usually expects full observations
Often discrete or mixed-variable friendly	Strong for continuous high-dimensional data
Structure learning is important	Architecture design is important

Both families aim for tractable probability computation, but they achieve it through different structural principles.

Comparison with Energy-Based Models

Probabilistic circuits	Energy-based models
Usually normalized by construction	Often unnormalized
Efficient marginal inference	Marginal inference often hard
Sampling can be direct	Sampling often MCMC-based
Less flexible without large structures	Very flexible energy functions
Structural constraints are central	Energy design and sampling are central

Probabilistic circuits prioritize tractability. Energy-based models prioritize flexibility.

Comparison with Autoregressive Models

Autoregressive models factor the joint distribution as

p(x) = \prod_{i=1}^{d}p(x_i\mid x_{<i}).

They give exact likelihoods but impose an ordering over variables.

Probabilistic circuits do not require a single fixed sequential factorization. Their structure can encode mixtures and decompositions over variable groups.

Probabilistic circuits	Autoregressive models
Structured factorization	Sequential factorization
Efficient marginal queries	Marginals often require summation or sampling
Good missing-data support	Missing-data handling can be awkward
Strong tractability guarantees	Strong sequence modeling scalability

Autoregressive transformers scale better for language and many modern generative tasks. Probabilistic circuits remain attractive when exact marginal and conditional inference are important.

PyTorch Sketch

A simple probabilistic circuit can be implemented as a differentiable computation graph.

Below is a small sum-product circuit for binary variables.

import torch
from torch import nn

class BernoulliLeaf(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.logits = nn.Parameter(torch.zeros(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        probs = torch.sigmoid(self.logits)
        log_p = x * torch.log(probs + 1e-8)
        log_p += (1 - x) * torch.log(1 - probs + 1e-8)
        return log_p.sum(dim=-1)

class SumNode(nn.Module):
    def __init__(self, children: list[nn.Module]):
        super().__init__()
        self.children = nn.ModuleList(children)
        self.logits = nn.Parameter(torch.zeros(len(children)))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        child_log_probs = torch.stack(
            [child(x) for child in self.children],
            dim=-1,
        )
        log_weights = torch.log_softmax(self.logits, dim=-1)
        return torch.logsumexp(child_log_probs + log_weights, dim=-1)

This example uses log-space computation for numerical stability. In real probabilistic circuits, product nodes would combine disjoint scopes rather than always evaluating every leaf on the full vector.

A training loss is simply negative log-likelihood:

def nll(model: nn.Module, x: torch.Tensor) -> torch.Tensor:
    return -model(x).mean()

This illustrates a key advantage: if the circuit is normalized by construction, likelihood training is direct.

Advantages

Probabilistic circuits have several strengths.

They support exact or efficient likelihood computation. They handle missing values naturally. They provide tractable marginal and conditional inference. They can represent mixtures and latent structure. They work well for domains where uncertainty, missing data, and structured inference matter.

They are especially useful when the question is not only “generate a sample” but also “answer probability queries reliably.”

Limitations

Probabilistic circuits also have limitations.

Their structural constraints can limit expressiveness. Very large circuits may be needed for complex high-dimensional data. Structure learning can be difficult. Neural alternatives often scale better for images, language, and audio. Hybrid models are promising but can be harder to design and analyze.

For this reason, probabilistic circuits are less dominant than transformers or diffusion models in mainstream deep learning. Their value is strongest in tractable probabilistic reasoning and uncertainty-aware modeling.

Role in Deep Learning

Probabilistic circuits provide a useful counterpoint to modern black-box neural models.

They show that tractable inference can be designed into the model architecture rather than approximated after the fact. This matters for applications where we need calibrated probabilities, missing-data inference, or reliable conditional reasoning.

In a deep learning book, probabilistic circuits are useful because they clarify a basic tradeoff:

unrestricted neural models are flexible but often intractable probabilistically,
tractable probabilistic models are efficient but structurally constrained,
hybrid systems try to combine the two.

Summary

Probabilistic circuits are tractable probabilistic models represented as computational graphs of leaves, sums, and products. Leaf nodes define simple distributions. Product nodes factor variables. Sum nodes form mixtures.

With structural conditions such as decomposability and smoothness, these circuits support efficient likelihood evaluation, marginal inference, conditional inference, and sampling.

They are less common than transformers, diffusion models, and VAEs in modern large-scale generative modeling, but they remain important for exact probabilistic reasoning, missing-data inference, and the design of neural-symbolic or uncertainty-aware systems.