Boltzmann Machines

A Boltzmann machine is a probabilistic neural network that defines a probability distribution over binary variables. It belongs to the family of energy-based models. Instead of computing an output directly from an input, it assigns an energy to each possible configuration of variables. Configurations with low energy receive high probability. Configurations with high energy receive low probability.

The central idea is simple: learning means shaping an energy surface so that observed data configurations have low energy, while unlikely or invalid configurations have high energy.

Energy-Based View

Let

x \in \{0,1\}^d

be a binary vector. A Boltzmann machine defines an energy function

E(x)

and converts energy into probability by

p(x) = \frac{\exp(-E(x))}{Z}.

The normalizing constant

Z = \sum_{x'} \exp(-E(x'))

is called the partition function. The sum ranges over all possible binary configurations $x'$ .

For $d$ binary variables, there are $2^d$ possible configurations. This exponential number is the main computational difficulty in Boltzmann machines.

Low energy gives high probability because of the negative sign:

E(x_1) < E(x_2) \quad\Rightarrow\quad p(x_1) > p(x_2).

Thus, the model learns by lowering the energy of training examples and raising the energy of competing configurations.

Visible and Hidden Units

A Boltzmann machine contains two kinds of variables.

Visible units represent observed data. Hidden units represent latent structure. If $v$ denotes visible variables and $h$ denotes hidden variables, the full state of the machine is

x = (v,h).

The model defines a joint distribution

p(v,h) = \frac{\exp(-E(v,h))}{Z}.

The probability of a visible vector is obtained by summing out the hidden variables:

p(v) = \sum_h p(v,h).

Hidden variables allow the model to represent dependencies among visible variables. Without hidden variables, the model can only express limited distributions. With hidden variables, the model can assign high probability to complex patterns, clusters, and latent factors.

Energy Function

A general Boltzmann machine uses pairwise interactions between units. For binary variables $x_i \in \{0,1\}$ , one common energy function is

E(x) = -\sum_i b_i x_i -\sum_{i<j} w_{ij}x_i x_j.

Here:

Symbol	Meaning
$x_i$	Binary state of unit $i$
$b_i$	Bias of unit $i$
$w_{ij}$	Interaction weight between units $i$ and $j$
$E(x)$	Energy of configuration $x$

The bias $b_i$ controls whether unit $i$ tends to be active. The weight $w_{ij}$ controls whether units $i$ and $j$ tend to be active together.

If $w_{ij} > 0$ , then activating both $x_i$ and $x_j$ lowers the energy because of the negative sign. This makes joint activation more likely.

If $w_{ij} < 0$ , then activating both units raises the energy. This makes joint activation less likely.

Probability Distribution

The Boltzmann machine uses the Boltzmann distribution:

p(x) = \frac{\exp(-E(x))}{\sum_{x'} \exp(-E(x'))}.

This equation defines a valid probability distribution because every term is positive and all probabilities sum to one.

The partition function

Z = \sum_{x'} \exp(-E(x'))

is needed to normalize the distribution. However, computing $Z$ exactly is generally intractable for large models. If the model has $n$ binary units, the sum contains $2^n$ terms.

This difficulty affects both training and evaluation. We can compute the energy of one configuration easily, but computing its exact probability requires comparing it with all other possible configurations.

Conditional Distributions

A key property of Boltzmann machines is that the conditional distribution of one unit given all others has a simple form.

Let $x_{-i}$ denote all variables except $x_i$ . Then

p(x_i = 1 \mid x_{-i}) = \sigma\left(b_i + \sum_{j\neq i} w_{ij}x_j\right),

where

\sigma(a) = \frac{1}{1+\exp(-a)}

is the logistic sigmoid function.

This formula means that each unit behaves like a stochastic neuron. Its probability of being active depends on its bias and the weighted states of the other units.

This conditional distribution enables Gibbs sampling. In Gibbs sampling, we repeatedly update one variable at a time by sampling from its conditional distribution. Over time, the samples approximate the model distribution.

Learning Objective

Given training data

\mathcal{D} = \{v^{(1)}, v^{(2)}, \ldots, v^{(N)}\},

we would like to maximize the log-likelihood

\sum_{n=1}^N \log p(v^{(n)}).

Equivalently, we minimize the negative log-likelihood:

-\sum_{n=1}^N \log p(v^{(n)}).

For one visible example $v$ ,

\log p(v) = \log \sum_h \exp(-E(v,h)) - \log Z.

This expression has two parts.

The first part rewards low energy for configurations compatible with the data. The second part accounts for all possible configurations through the partition function.

Training therefore has a positive phase and a negative phase.

Positive and Negative Phases

The gradient of the log-likelihood has a characteristic form:

\frac{\partial \log p(v)}{\partial \theta} = -\mathbb{E}_{p(h\mid v)} \left[ \frac{\partial E(v,h)}{\partial \theta} \right] + \mathbb{E}_{p(v,h)} \left[ \frac{\partial E(v,h)}{\partial \theta} \right].

Here $\theta$ denotes a parameter such as a bias or weight.

The first expectation is taken under the posterior distribution of hidden variables given the data. This is the positive phase. It lowers the energy of data-like configurations.

The second expectation is taken under the model distribution. This is the negative phase. It raises the energy of configurations that the model itself tends to generate.

For a weight $w_{ij}$ , the gradient often takes the form

\frac{\partial \log p(v)}{\partial w_{ij}} = \mathbb{E}_{\text{data}}[x_i x_j] - \mathbb{E}_{\text{model}}[x_i x_j].

This has a direct interpretation. The model increases a weight when two units are correlated in the data more often than they are correlated under the model. It decreases a weight when the model produces that correlation too often.

Sampling

Exact sampling from a Boltzmann machine is generally difficult, so training usually relies on Markov chain Monte Carlo.

A typical Gibbs sampling procedure is:

Initialize all variables randomly.
Pick a unit $x_i$ .
Compute $p(x_i=1\mid x_{-i})$ .
Sample a new value for $x_i$ .
Repeat for many steps.

In practice, many Gibbs updates may be needed before the samples approximate the true model distribution. This slow mixing is one of the main reasons full Boltzmann machines are rarely used in modern large-scale deep learning.

The model may become especially hard to sample from when the energy surface contains many separated low-energy regions. A Markov chain can become stuck in one region for a long time.

Why Full Boltzmann Machines Are Hard to Train

A full Boltzmann machine allows connections between visible units, hidden units, and pairs of hidden units. This gives the model expressive power, but it makes inference difficult.

There are three main computational obstacles.

First, the partition function is expensive. It requires summing over exponentially many states.

Second, posterior inference over hidden variables is difficult. Computing $p(h\mid v)$ generally requires summing over many hidden configurations.

Third, model samples can be slow to obtain. The Markov chain may need many steps to reach high-probability regions.

Because of these issues, practical work often uses restricted architectures.

Restricted Boltzmann Machines

A restricted Boltzmann machine, or RBM, simplifies the architecture by removing visible-visible and hidden-hidden connections. It keeps only connections between visible and hidden units.

The energy function becomes

E(v,h) = -b^\top v -c^\top h -v^\top W h.

Here:

Symbol	Meaning
$v$	Visible vector
$h$	Hidden vector
$b$	Visible bias
$c$	Hidden bias
$W$	Visible-hidden weight matrix

The bipartite structure gives important conditional independence properties:

p(h\mid v) = \prod_j p(h_j\mid v),

and

p(v\mid h) = \prod_i p(v_i\mid h).

Thus, all hidden units can be sampled in parallel given the visible units, and all visible units can be sampled in parallel given the hidden units.

This makes RBMs much easier to train than general Boltzmann machines.

Contrastive Divergence

Restricted Boltzmann machines are often trained using contrastive divergence.

The idea is to approximate the negative phase using a short Markov chain initialized at the data. For contrastive divergence with $k$ steps, called CD- $k$ , the procedure is:

Start with a training example $v_0$ .
Sample $h_0 \sim p(h\mid v_0)$ .
Alternate sampling $v\mid h$ and $h\mid v$ for $k$ steps.
Use the final sample $v_k, h_k$ to approximate the model expectation.
Update parameters using the difference between data statistics and reconstructed statistics.

For the weight matrix $W$ , the update has the rough form

\Delta W \propto v_0 h_0^\top - v_k h_k^\top.

The first term lowers the energy of the observed data. The second term raises the energy of nearby reconstructions produced by the model.

Contrastive divergence is biased, but it is much cheaper than exact maximum likelihood. Historically, it made RBMs practical enough for pretraining deep belief networks and other early deep learning systems.

PyTorch Sketch

A binary RBM can be written compactly in PyTorch. The code below shows the main operations: sampling hidden units from visible units, sampling visible units from hidden units, and computing free energy.

import torch
from torch import nn

class RBM(nn.Module):
    def __init__(self, n_visible: int, n_hidden: int):
        super().__init__()
        self.W = nn.Parameter(torch.randn(n_visible, n_hidden) * 0.01)
        self.v_bias = nn.Parameter(torch.zeros(n_visible))
        self.h_bias = nn.Parameter(torch.zeros(n_hidden))

    def hidden_prob(self, v: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(v @ self.W + self.h_bias)

    def visible_prob(self, h: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(h @ self.W.T + self.v_bias)

    def sample_hidden(self, v: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        p = self.hidden_prob(v)
        h = torch.bernoulli(p)
        return p, h

    def sample_visible(self, h: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        p = self.visible_prob(h)
        v = torch.bernoulli(p)
        return p, v

    def free_energy(self, v: torch.Tensor) -> torch.Tensor:
        visible_term = v @ self.v_bias
        hidden_term = torch.log1p(torch.exp(v @ self.W + self.h_bias)).sum(dim=1)
        return -visible_term - hidden_term

The free energy of a visible vector is

F(v) = -b^\top v - \sum_j \log\left(1+\exp(c_j + W_{:,j}^\top v)\right).

It satisfies

p(v) = \frac{\exp(-F(v))}{Z}.

The free energy is useful because it sums out the hidden units analytically in an RBM.

A simple CD- $1$ training step can be written as follows:

def cd1_loss(rbm: RBM, v0: torch.Tensor) -> torch.Tensor:
    _, h0 = rbm.sample_hidden(v0)
    _, v1 = rbm.sample_visible(h0)

    positive_free_energy = rbm.free_energy(v0)
    negative_free_energy = rbm.free_energy(v1.detach())

    return positive_free_energy.mean() - negative_free_energy.mean()

This code expresses the contrastive divergence objective. In a full training loop, we would minimize this quantity with an optimizer.

The call to detach() prevents gradients from flowing through the sampling chain. This is a common implementation choice for contrastive divergence.

Relationship to Modern Deep Learning

Boltzmann machines are less common in modern production deep learning than transformers, diffusion models, and autoregressive models. Their main difficulties are sampling, partition-function estimation, and slow likelihood computation.

However, the ideas remain important.

First, Boltzmann machines introduced a clear energy-based view of learning. This view still appears in modern energy-based models, contrastive learning, score-based models, and some interpretations of diffusion models.

Second, they demonstrate the connection between probability, statistical physics, and neural networks. Concepts such as energy, temperature, partition functions, and Gibbs sampling continue to appear in probabilistic machine learning.

Third, RBMs were historically important for unsupervised pretraining. Before large labeled datasets, modern initialization methods, normalization layers, and powerful accelerators became common, layer-wise pretraining helped train deep models.

Fourth, Boltzmann machines make the distinction between local computation and global normalization explicit. Computing an energy is easy. Normalizing over all configurations is hard. This same tension appears in many probabilistic models.

Summary

A Boltzmann machine is an energy-based probabilistic model over binary variables. It assigns each configuration an energy and converts energies into probabilities using the Boltzmann distribution.

The model contains visible units for observed data and hidden units for latent structure. Learning lowers the energy of data-like configurations and raises the energy of configurations sampled from the model.

Full Boltzmann machines are expressive but hard to train because exact inference, exact sampling, and exact partition-function computation are generally intractable. Restricted Boltzmann machines simplify the graph structure, making conditional sampling efficient. Contrastive divergence gives a practical approximate training method.

The direct use of Boltzmann machines has declined, but their concepts remain foundational for understanding energy-based modeling, probabilistic generative models, and the statistical mechanics view of deep learning.