A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure.
A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure. The restriction removes all connections between units of the same type. Visible units do not connect to other visible units, and hidden units do not connect to other hidden units.
This restriction makes inference and sampling tractable enough for practical training.
RBMs were historically important in early deep learning systems. They were used for unsupervised pretraining, feature learning, collaborative filtering, dimensionality reduction, and deep belief networks. Although modern deep learning relies more heavily on transformers and diffusion models, RBMs remain important for understanding probabilistic representation learning and energy-based models.
Architecture of an RBM
An RBM contains two layers:
| Layer | Purpose |
|---|---|
| Visible layer | Represents observed data |
| Hidden layer | Represents latent features |
Let
denote the visible vector and
denote the hidden vector.
The architecture forms a bipartite graph:
- Every visible unit connects to every hidden unit.
- No visible-visible connections exist.
- No hidden-hidden connections exist.
This structure gives the model conditional independence properties that greatly simplify learning.
Energy Function
The RBM energy function is
genui{“math_block_widget_always_prefetch_v2”:{“content”:“E(v,h)=-b^Tv-c^Th-v^TWh”}}
Here:
| Symbol | Meaning |
|---|---|
| Visible vector | |
| Hidden vector | |
| Weight matrix | |
| Visible bias vector | |
| Hidden bias vector |
If visible unit and hidden unit are simultaneously active, the interaction term contributes
A positive weight lowers the energy when both units activate together. This encourages correlated activations.
Probability Distribution
The RBM defines a joint probability distribution:
where
is the partition function.
The marginal probability of a visible vector is
The model therefore assigns probabilities to observed data by summing over all possible hidden configurations.
As in all energy-based models:
- low energy corresponds to high probability,
- high energy corresponds to low probability.
Conditional Independence
The bipartite structure produces a major simplification.
Given the visible vector, all hidden units become conditionally independent:
Likewise, given the hidden vector, all visible units become conditionally independent:
This means that all hidden units can be sampled simultaneously, and all visible units can also be sampled simultaneously.
This property makes Gibbs sampling efficient.
Hidden Unit Activation Probabilities
For binary hidden units, the activation probability is
where is the sigmoid function.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“p(h_j=1\mid v)=\sigma\left(c_j+\sum_i w_{ij}v_i\right)”}}
Similarly, visible units satisfy
genui{“math_block_widget_always_prefetch_v2”:{“content”:“p(v_i=1\mid h)=\sigma\left(b_i+\sum_j w_{ij}h_j\right)”}}
Each hidden unit therefore acts as a stochastic feature detector.
The hidden layer transforms the visible input into a latent representation. If certain visible patterns appear frequently together, the RBM can learn hidden units that respond strongly to those patterns.
Free Energy
The hidden units can be analytically marginalized out in an RBM. This produces the free energy function:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“F(v)=-b^Tv-\sum_j\log\left(1+\exp(c_j+W_{:,j}^Tv)\right)”}}
The visible distribution becomes
The free energy gives a scalar score for a visible vector. Lower free energy corresponds to higher probability under the model.
Free energy is widely used during training because it avoids explicitly enumerating hidden configurations.
Gibbs Sampling
RBMs are commonly sampled using block Gibbs sampling.
The procedure alternates between:
- Sampling hidden units from visible units.
- Sampling visible units from hidden units.
The sampling chain proceeds as
Since all hidden units are conditionally independent given , they can be sampled in parallel. The same holds for visible units given .
A Gibbs step therefore consists of two matrix operations and two Bernoulli sampling operations.
Learning Objective
The training objective is maximum likelihood estimation.
For training data
we maximize
The gradient of the log-likelihood with respect to a weight satisfies
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\frac{\partial \log p(v)}{\partial w_{ij}}=\mathbb{E}{\mathrm{data}}[v_i h_j]-\mathbb{E}{\mathrm{model}}[v_i h_j]"}}
The first expectation is computed using training data. The second expectation is computed using samples from the model distribution.
This equation has an intuitive interpretation:
- increase correlations observed in real data,
- decrease correlations generated excessively by the model.
Contrastive Divergence
Exact maximum likelihood training is expensive because the model expectation requires long Markov chains.
Contrastive divergence, or CD, approximates this expectation using short chains initialized at the data.
For CD-:
- Start from a training example .
- Sample hidden units .
- Alternate Gibbs updates for steps.
- Obtain reconstruction .
- Update parameters using the difference between data statistics and reconstruction statistics.
The weight update approximates
For CD-1, only one Gibbs step is used. Surprisingly, this approximation often works reasonably well in practice.
Persistent Contrastive Divergence
Contrastive divergence initializes the Markov chain from data each iteration. This introduces bias.
Persistent contrastive divergence instead maintains long-running chains across training iterations. The chains evolve continuously during optimization.
Rather than restarting from data:
persistent methods use
This better approximates the true model distribution.
Binary and Gaussian RBMs
The standard RBM assumes binary visible units. This works naturally for binary images or Bernoulli data.
Real-valued data requires modified visible distributions.
A Gaussian-Bernoulli RBM uses:
- Gaussian visible units,
- binary hidden units.
The energy becomes
This allows the RBM to model continuous inputs such as pixel intensities or sensor values.
Other variants include:
| Variant | Description |
|---|---|
| Bernoulli-Bernoulli RBM | Binary visible and hidden units |
| Gaussian-Bernoulli RBM | Continuous visible units |
| Softmax RBM | Categorical variables |
| Replicated softmax RBM | Text modeling |
| Conditional RBM | Conditioned sequential modeling |
RBMs as Feature Learners
RBMs can learn distributed latent representations.
Suppose the visible units represent images. Hidden units may learn detectors for:
- edges,
- corners,
- textures,
- object parts.
A hidden unit activates strongly when its preferred visual pattern appears.
The representation learned by hidden units can then be used for:
- classification,
- clustering,
- dimensionality reduction,
- retrieval,
- pretraining deeper networks.
Before large-scale supervised learning became dominant, RBMs were widely used for unsupervised representation learning.
Deep Belief Networks
RBMs played a central role in deep belief networks, or DBNs.
A DBN stacks RBMs layer by layer:
- Train the first RBM on raw data.
- Use hidden activations as input to the next RBM.
- Repeat for deeper layers.
This layer-wise unsupervised pretraining was historically important because deep networks were difficult to optimize directly.
Pretraining initialized the network in a useful region of parameter space before supervised fine-tuning.
Modern techniques such as:
- better initialization,
- normalization,
- residual networks,
- Adam optimization,
- large datasets,
- GPUs,
largely replaced RBM pretraining.
Relationship to Autoencoders
RBMs and autoencoders both learn latent representations, but they differ fundamentally.
| RBM | Autoencoder |
|---|---|
| Probabilistic model | Deterministic model |
| Energy-based | Reconstruction-based |
| Learns probability distribution | Learns encoding function |
| Uses sampling | Uses direct forward passes |
| Training often stochastic | Training usually gradient-based |
An RBM defines a joint probability distribution over visible and hidden variables. An autoencoder directly maps inputs to latent vectors and reconstructs them.
Despite these differences, both attempt to discover useful structure in unlabeled data.
PyTorch Implementation
A minimal RBM implementation uses matrix multiplications and Bernoulli sampling.
import torch
from torch import nn
class RBM(nn.Module):
def __init__(self, n_visible: int, n_hidden: int):
super().__init__()
self.W = nn.Parameter(
torch.randn(n_visible, n_hidden) * 0.01
)
self.v_bias = nn.Parameter(torch.zeros(n_visible))
self.h_bias = nn.Parameter(torch.zeros(n_hidden))
def hidden_prob(self, v):
return torch.sigmoid(v @ self.W + self.h_bias)
def visible_prob(self, h):
return torch.sigmoid(h @ self.W.T + self.v_bias)
def sample_hidden(self, v):
p = self.hidden_prob(v)
h = torch.bernoulli(p)
return p, h
def sample_visible(self, h):
p = self.visible_prob(h)
v = torch.bernoulli(p)
return p, vA simple CD-1 training step:
def cd1_step(rbm, v0):
_, h0 = rbm.sample_hidden(v0)
_, v1 = rbm.sample_visible(h0)
_, h1 = rbm.sample_hidden(v1)
positive_grad = v0.T @ h0
negative_grad = v1.T @ h1
return positive_grad - negative_gradThis code illustrates the core learning idea: reinforce correlations in data and weaken correlations in reconstructions.
Limitations of RBMs
RBMs have several important limitations.
First, sampling-based training is slow compared with direct gradient-based optimization in feedforward networks.
Second, the partition function remains difficult to estimate.
Third, likelihood evaluation is expensive.
Fourth, Markov chains can mix slowly in high-dimensional spaces.
Fifth, modern architectures such as transformers and diffusion models scale more effectively to massive datasets.
As a result, RBMs are now used primarily for historical understanding, specialized probabilistic modeling, and research into energy-based learning.
Summary
Restricted Boltzmann machines simplify Boltzmann machines by using a bipartite graph structure. This restriction creates conditional independence properties that make inference and Gibbs sampling efficient.
An RBM defines an energy function over visible and hidden variables. Learning lowers the energy of training data and raises the energy of reconstructed samples. Contrastive divergence provides a practical approximate training method.
RBMs were historically important for unsupervised representation learning and deep belief networks. Although they are less common in modern large-scale deep learning, they remain foundational for understanding energy-based models, probabilistic latent-variable learning, and the evolution of deep neural networks.