Skip to content

Restricted Boltzmann Machines

A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure.

A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure. The restriction removes all connections between units of the same type. Visible units do not connect to other visible units, and hidden units do not connect to other hidden units.

This restriction makes inference and sampling tractable enough for practical training.

RBMs were historically important in early deep learning systems. They were used for unsupervised pretraining, feature learning, collaborative filtering, dimensionality reduction, and deep belief networks. Although modern deep learning relies more heavily on transformers and diffusion models, RBMs remain important for understanding probabilistic representation learning and energy-based models.

Architecture of an RBM

An RBM contains two layers:

LayerPurpose
Visible layerRepresents observed data
Hidden layerRepresents latent features

Let

v{0,1}nv v \in \{0,1\}^{n_v}

denote the visible vector and

h{0,1}nh h \in \{0,1\}^{n_h}

denote the hidden vector.

The architecture forms a bipartite graph:

  • Every visible unit connects to every hidden unit.
  • No visible-visible connections exist.
  • No hidden-hidden connections exist.

This structure gives the model conditional independence properties that greatly simplify learning.

Energy Function

The RBM energy function is

E(v,h)=bvchvWh. E(v,h) = -b^\top v -c^\top h -v^\top W h.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“E(v,h)=-b^Tv-c^Th-v^TWh”}}

Here:

SymbolMeaning
vvVisible vector
hhHidden vector
WWWeight matrix
bbVisible bias vector
ccHidden bias vector

If visible unit viv_i and hidden unit hjh_j are simultaneously active, the interaction term contributes

wijvihj. -w_{ij}v_i h_j.

A positive weight lowers the energy when both units activate together. This encourages correlated activations.

Probability Distribution

The RBM defines a joint probability distribution:

p(v,h)=exp(E(v,h))Z, p(v,h) = \frac{\exp(-E(v,h))}{Z},

where

Z=v,hexp(E(v,h)) Z = \sum_{v,h}\exp(-E(v,h))

is the partition function.

The marginal probability of a visible vector is

p(v)=hp(v,h). p(v) = \sum_h p(v,h).

The model therefore assigns probabilities to observed data by summing over all possible hidden configurations.

As in all energy-based models:

  • low energy corresponds to high probability,
  • high energy corresponds to low probability.

Conditional Independence

The bipartite structure produces a major simplification.

Given the visible vector, all hidden units become conditionally independent:

p(hv)=jp(hjv). p(h\mid v) = \prod_j p(h_j\mid v).

Likewise, given the hidden vector, all visible units become conditionally independent:

p(vh)=ip(vih). p(v\mid h) = \prod_i p(v_i\mid h).

This means that all hidden units can be sampled simultaneously, and all visible units can also be sampled simultaneously.

This property makes Gibbs sampling efficient.

Hidden Unit Activation Probabilities

For binary hidden units, the activation probability is

p(hj=1v)=σ(cj+iwijvi), p(h_j = 1 \mid v) = \sigma\left(c_j + \sum_i w_{ij}v_i\right),

where σ\sigma is the sigmoid function.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“p(h_j=1\mid v)=\sigma\left(c_j+\sum_i w_{ij}v_i\right)”}}

Similarly, visible units satisfy

p(vi=1h)=σ(bi+jwijhj). p(v_i = 1 \mid h) = \sigma\left(b_i + \sum_j w_{ij}h_j\right).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“p(v_i=1\mid h)=\sigma\left(b_i+\sum_j w_{ij}h_j\right)”}}

Each hidden unit therefore acts as a stochastic feature detector.

The hidden layer transforms the visible input into a latent representation. If certain visible patterns appear frequently together, the RBM can learn hidden units that respond strongly to those patterns.

Free Energy

The hidden units can be analytically marginalized out in an RBM. This produces the free energy function:

F(v)=bvjlog(1+exp(cj+W:,jv)). F(v) = -b^\top v - \sum_j \log\left( 1+\exp(c_j + W_{:,j}^\top v) \right).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“F(v)=-b^Tv-\sum_j\log\left(1+\exp(c_j+W_{:,j}^Tv)\right)”}}

The visible distribution becomes

p(v)=exp(F(v))Z. p(v) = \frac{\exp(-F(v))}{Z}.

The free energy gives a scalar score for a visible vector. Lower free energy corresponds to higher probability under the model.

Free energy is widely used during training because it avoids explicitly enumerating hidden configurations.

Gibbs Sampling

RBMs are commonly sampled using block Gibbs sampling.

The procedure alternates between:

  1. Sampling hidden units from visible units.
  2. Sampling visible units from hidden units.

The sampling chain proceeds as

v(0)h(0)v(1)h(1) v^{(0)} \rightarrow h^{(0)} \rightarrow v^{(1)} \rightarrow h^{(1)} \rightarrow \cdots

Since all hidden units are conditionally independent given vv, they can be sampled in parallel. The same holds for visible units given hh.

A Gibbs step therefore consists of two matrix operations and two Bernoulli sampling operations.

Learning Objective

The training objective is maximum likelihood estimation.

For training data

D={v(1),,v(N)}, \mathcal{D}=\{v^{(1)},\ldots,v^{(N)}\},

we maximize

nlogp(v(n)). \sum_n \log p(v^{(n)}).

The gradient of the log-likelihood with respect to a weight satisfies

logp(v)wij=Edata[vihj]Emodel[vihj]. \frac{\partial \log p(v)}{\partial w_{ij}} = \mathbb{E}_{\text{data}}[v_i h_j] - \mathbb{E}_{\text{model}}[v_i h_j].

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\frac{\partial \log p(v)}{\partial w_{ij}}=\mathbb{E}{\mathrm{data}}[v_i h_j]-\mathbb{E}{\mathrm{model}}[v_i h_j]"}}

The first expectation is computed using training data. The second expectation is computed using samples from the model distribution.

This equation has an intuitive interpretation:

  • increase correlations observed in real data,
  • decrease correlations generated excessively by the model.

Contrastive Divergence

Exact maximum likelihood training is expensive because the model expectation requires long Markov chains.

Contrastive divergence, or CD, approximates this expectation using short chains initialized at the data.

For CD-kk:

  1. Start from a training example v0v_0.
  2. Sample hidden units h0h_0.
  3. Alternate Gibbs updates for kk steps.
  4. Obtain reconstruction vkv_k.
  5. Update parameters using the difference between data statistics and reconstruction statistics.

The weight update approximates

ΔWv0h0vkhk. \Delta W \propto v_0 h_0^\top - v_k h_k^\top.

For CD-1, only one Gibbs step is used. Surprisingly, this approximation often works reasonably well in practice.

Persistent Contrastive Divergence

Contrastive divergence initializes the Markov chain from data each iteration. This introduces bias.

Persistent contrastive divergence instead maintains long-running chains across training iterations. The chains evolve continuously during optimization.

Rather than restarting from data:

v0training example, v_0 \sim \text{training example},

persistent methods use

v0previous chain state. v_0 \sim \text{previous chain state}.

This better approximates the true model distribution.

Binary and Gaussian RBMs

The standard RBM assumes binary visible units. This works naturally for binary images or Bernoulli data.

Real-valued data requires modified visible distributions.

A Gaussian-Bernoulli RBM uses:

  • Gaussian visible units,
  • binary hidden units.

The energy becomes

E(v,h)=i(vibi)22σi2jcjhjijviσi2wijhj. E(v,h) = \sum_i \frac{(v_i-b_i)^2}{2\sigma_i^2} - \sum_j c_j h_j - \sum_{ij} \frac{v_i}{\sigma_i^2}w_{ij}h_j.

This allows the RBM to model continuous inputs such as pixel intensities or sensor values.

Other variants include:

VariantDescription
Bernoulli-Bernoulli RBMBinary visible and hidden units
Gaussian-Bernoulli RBMContinuous visible units
Softmax RBMCategorical variables
Replicated softmax RBMText modeling
Conditional RBMConditioned sequential modeling

RBMs as Feature Learners

RBMs can learn distributed latent representations.

Suppose the visible units represent images. Hidden units may learn detectors for:

  • edges,
  • corners,
  • textures,
  • object parts.

A hidden unit activates strongly when its preferred visual pattern appears.

The representation learned by hidden units can then be used for:

  • classification,
  • clustering,
  • dimensionality reduction,
  • retrieval,
  • pretraining deeper networks.

Before large-scale supervised learning became dominant, RBMs were widely used for unsupervised representation learning.

Deep Belief Networks

RBMs played a central role in deep belief networks, or DBNs.

A DBN stacks RBMs layer by layer:

  1. Train the first RBM on raw data.
  2. Use hidden activations as input to the next RBM.
  3. Repeat for deeper layers.

This layer-wise unsupervised pretraining was historically important because deep networks were difficult to optimize directly.

Pretraining initialized the network in a useful region of parameter space before supervised fine-tuning.

Modern techniques such as:

  • better initialization,
  • normalization,
  • residual networks,
  • Adam optimization,
  • large datasets,
  • GPUs,

largely replaced RBM pretraining.

Relationship to Autoencoders

RBMs and autoencoders both learn latent representations, but they differ fundamentally.

RBMAutoencoder
Probabilistic modelDeterministic model
Energy-basedReconstruction-based
Learns probability distributionLearns encoding function
Uses samplingUses direct forward passes
Training often stochasticTraining usually gradient-based

An RBM defines a joint probability distribution over visible and hidden variables. An autoencoder directly maps inputs to latent vectors and reconstructs them.

Despite these differences, both attempt to discover useful structure in unlabeled data.

PyTorch Implementation

A minimal RBM implementation uses matrix multiplications and Bernoulli sampling.

import torch
from torch import nn

class RBM(nn.Module):
    def __init__(self, n_visible: int, n_hidden: int):
        super().__init__()

        self.W = nn.Parameter(
            torch.randn(n_visible, n_hidden) * 0.01
        )

        self.v_bias = nn.Parameter(torch.zeros(n_visible))
        self.h_bias = nn.Parameter(torch.zeros(n_hidden))

    def hidden_prob(self, v):
        return torch.sigmoid(v @ self.W + self.h_bias)

    def visible_prob(self, h):
        return torch.sigmoid(h @ self.W.T + self.v_bias)

    def sample_hidden(self, v):
        p = self.hidden_prob(v)
        h = torch.bernoulli(p)
        return p, h

    def sample_visible(self, h):
        p = self.visible_prob(h)
        v = torch.bernoulli(p)
        return p, v

A simple CD-1 training step:

def cd1_step(rbm, v0):
    _, h0 = rbm.sample_hidden(v0)
    _, v1 = rbm.sample_visible(h0)
    _, h1 = rbm.sample_hidden(v1)

    positive_grad = v0.T @ h0
    negative_grad = v1.T @ h1

    return positive_grad - negative_grad

This code illustrates the core learning idea: reinforce correlations in data and weaken correlations in reconstructions.

Limitations of RBMs

RBMs have several important limitations.

First, sampling-based training is slow compared with direct gradient-based optimization in feedforward networks.

Second, the partition function remains difficult to estimate.

Third, likelihood evaluation is expensive.

Fourth, Markov chains can mix slowly in high-dimensional spaces.

Fifth, modern architectures such as transformers and diffusion models scale more effectively to massive datasets.

As a result, RBMs are now used primarily for historical understanding, specialized probabilistic modeling, and research into energy-based learning.

Summary

Restricted Boltzmann machines simplify Boltzmann machines by using a bipartite graph structure. This restriction creates conditional independence properties that make inference and Gibbs sampling efficient.

An RBM defines an energy function over visible and hidden variables. Learning lowers the energy of training data and raises the energy of reconstructed samples. Contrastive divergence provides a practical approximate training method.

RBMs were historically important for unsupervised representation learning and deep belief networks. Although they are less common in modern large-scale deep learning, they remain foundational for understanding energy-based models, probabilistic latent-variable learning, and the evolution of deep neural networks.