# Restricted Boltzmann Machines

A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure. The restriction removes all connections between units of the same type. Visible units do not connect to other visible units, and hidden units do not connect to other hidden units.

This restriction makes inference and sampling tractable enough for practical training.

RBMs were historically important in early deep learning systems. They were used for unsupervised pretraining, feature learning, collaborative filtering, dimensionality reduction, and deep belief networks. Although modern deep learning relies more heavily on transformers and diffusion models, RBMs remain important for understanding probabilistic representation learning and energy-based models.

### Architecture of an RBM

An RBM contains two layers:

| Layer | Purpose |
|---|---|
| Visible layer | Represents observed data |
| Hidden layer | Represents latent features |

Let

$$
v \in \{0,1\}^{n_v}
$$

denote the visible vector and

$$
h \in \{0,1\}^{n_h}
$$

denote the hidden vector.

The architecture forms a bipartite graph:

- Every visible unit connects to every hidden unit.
- No visible-visible connections exist.
- No hidden-hidden connections exist.

This structure gives the model conditional independence properties that greatly simplify learning.

### Energy Function

The RBM energy function is

$$
E(v,h) =
-b^\top v
-c^\top h
-v^\top W h.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"E(v,h)=-b^Tv-c^Th-v^TWh"}}

Here:

| Symbol | Meaning |
|---|---|
| $v$ | Visible vector |
| $h$ | Hidden vector |
| $W$ | Weight matrix |
| $b$ | Visible bias vector |
| $c$ | Hidden bias vector |

If visible unit $v_i$ and hidden unit $h_j$ are simultaneously active, the interaction term contributes

$$
-w_{ij}v_i h_j.
$$

A positive weight lowers the energy when both units activate together. This encourages correlated activations.

### Probability Distribution

The RBM defines a joint probability distribution:

$$
p(v,h) =
\frac{\exp(-E(v,h))}{Z},
$$

where

$$
Z =
\sum_{v,h}\exp(-E(v,h))
$$

is the partition function.

The marginal probability of a visible vector is

$$
p(v) =
\sum_h p(v,h).
$$

The model therefore assigns probabilities to observed data by summing over all possible hidden configurations.

As in all energy-based models:

- low energy corresponds to high probability,
- high energy corresponds to low probability.

### Conditional Independence

The bipartite structure produces a major simplification.

Given the visible vector, all hidden units become conditionally independent:

$$
p(h\mid v) =
\prod_j p(h_j\mid v).
$$

Likewise, given the hidden vector, all visible units become conditionally independent:

$$
p(v\mid h) =
\prod_i p(v_i\mid h).
$$

This means that all hidden units can be sampled simultaneously, and all visible units can also be sampled simultaneously.

This property makes Gibbs sampling efficient.

### Hidden Unit Activation Probabilities

For binary hidden units, the activation probability is

$$
p(h_j = 1 \mid v) =
\sigma\left(c_j + \sum_i w_{ij}v_i\right),
$$

where $\sigma$ is the sigmoid function.

genui{"math_block_widget_always_prefetch_v2":{"content":"p(h_j=1\\mid v)=\\sigma\\left(c_j+\\sum_i w_{ij}v_i\\right)"}}

Similarly, visible units satisfy

$$
p(v_i = 1 \mid h) =
\sigma\left(b_i + \sum_j w_{ij}h_j\right).
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"p(v_i=1\\mid h)=\\sigma\\left(b_i+\\sum_j w_{ij}h_j\\right)"}}

Each hidden unit therefore acts as a stochastic feature detector.

The hidden layer transforms the visible input into a latent representation. If certain visible patterns appear frequently together, the RBM can learn hidden units that respond strongly to those patterns.

### Free Energy

The hidden units can be analytically marginalized out in an RBM. This produces the free energy function:

$$
F(v) =
-b^\top v -
\sum_j
\log\left(
1+\exp(c_j + W_{:,j}^\top v)
\right).
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"F(v)=-b^Tv-\\sum_j\\log\\left(1+\\exp(c_j+W_{:,j}^Tv)\\right)"}}

The visible distribution becomes

$$
p(v) =
\frac{\exp(-F(v))}{Z}.
$$

The free energy gives a scalar score for a visible vector. Lower free energy corresponds to higher probability under the model.

Free energy is widely used during training because it avoids explicitly enumerating hidden configurations.

### Gibbs Sampling

RBMs are commonly sampled using block Gibbs sampling.

The procedure alternates between:

1. Sampling hidden units from visible units.
2. Sampling visible units from hidden units.

The sampling chain proceeds as

$$
v^{(0)}
\rightarrow
h^{(0)}
\rightarrow
v^{(1)}
\rightarrow
h^{(1)}
\rightarrow
\cdots
$$

Since all hidden units are conditionally independent given $v$, they can be sampled in parallel. The same holds for visible units given $h$.

A Gibbs step therefore consists of two matrix operations and two Bernoulli sampling operations.

### Learning Objective

The training objective is maximum likelihood estimation.

For training data

$$
\mathcal{D}=\{v^{(1)},\ldots,v^{(N)}\},
$$

we maximize

$$
\sum_n \log p(v^{(n)}).
$$

The gradient of the log-likelihood with respect to a weight satisfies

$$
\frac{\partial \log p(v)}{\partial w_{ij}} =
\mathbb{E}_{\text{data}}[v_i h_j] -
\mathbb{E}_{\text{model}}[v_i h_j].
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"\\frac{\\partial \\log p(v)}{\\partial w_{ij}}=\\mathbb{E}_{\\mathrm{data}}[v_i h_j]-\\mathbb{E}_{\\mathrm{model}}[v_i h_j]"}}

The first expectation is computed using training data. The second expectation is computed using samples from the model distribution.

This equation has an intuitive interpretation:

- increase correlations observed in real data,
- decrease correlations generated excessively by the model.

### Contrastive Divergence

Exact maximum likelihood training is expensive because the model expectation requires long Markov chains.

Contrastive divergence, or CD, approximates this expectation using short chains initialized at the data.

For CD-$k$:

1. Start from a training example $v_0$.
2. Sample hidden units $h_0$.
3. Alternate Gibbs updates for $k$ steps.
4. Obtain reconstruction $v_k$.
5. Update parameters using the difference between data statistics and reconstruction statistics.

The weight update approximates

$$
\Delta W
\propto
v_0 h_0^\top -
v_k h_k^\top.
$$

For CD-1, only one Gibbs step is used. Surprisingly, this approximation often works reasonably well in practice.

### Persistent Contrastive Divergence

Contrastive divergence initializes the Markov chain from data each iteration. This introduces bias.

Persistent contrastive divergence instead maintains long-running chains across training iterations. The chains evolve continuously during optimization.

Rather than restarting from data:

$$
v_0 \sim \text{training example},
$$

persistent methods use

$$
v_0 \sim \text{previous chain state}.
$$

This better approximates the true model distribution.

### Binary and Gaussian RBMs

The standard RBM assumes binary visible units. This works naturally for binary images or Bernoulli data.

Real-valued data requires modified visible distributions.

A Gaussian-Bernoulli RBM uses:

- Gaussian visible units,
- binary hidden units.

The energy becomes

$$
E(v,h) =
\sum_i
\frac{(v_i-b_i)^2}{2\sigma_i^2} -
\sum_j c_j h_j -
\sum_{ij}
\frac{v_i}{\sigma_i^2}w_{ij}h_j.
$$

This allows the RBM to model continuous inputs such as pixel intensities or sensor values.

Other variants include:

| Variant | Description |
|---|---|
| Bernoulli-Bernoulli RBM | Binary visible and hidden units |
| Gaussian-Bernoulli RBM | Continuous visible units |
| Softmax RBM | Categorical variables |
| Replicated softmax RBM | Text modeling |
| Conditional RBM | Conditioned sequential modeling |

### RBMs as Feature Learners

RBMs can learn distributed latent representations.

Suppose the visible units represent images. Hidden units may learn detectors for:

- edges,
- corners,
- textures,
- object parts.

A hidden unit activates strongly when its preferred visual pattern appears.

The representation learned by hidden units can then be used for:

- classification,
- clustering,
- dimensionality reduction,
- retrieval,
- pretraining deeper networks.

Before large-scale supervised learning became dominant, RBMs were widely used for unsupervised representation learning.

### Deep Belief Networks

RBMs played a central role in deep belief networks, or DBNs.

A DBN stacks RBMs layer by layer:

1. Train the first RBM on raw data.
2. Use hidden activations as input to the next RBM.
3. Repeat for deeper layers.

This layer-wise unsupervised pretraining was historically important because deep networks were difficult to optimize directly.

Pretraining initialized the network in a useful region of parameter space before supervised fine-tuning.

Modern techniques such as:

- better initialization,
- normalization,
- residual networks,
- Adam optimization,
- large datasets,
- GPUs,

largely replaced RBM pretraining.

### Relationship to Autoencoders

RBMs and autoencoders both learn latent representations, but they differ fundamentally.

| RBM | Autoencoder |
|---|---|
| Probabilistic model | Deterministic model |
| Energy-based | Reconstruction-based |
| Learns probability distribution | Learns encoding function |
| Uses sampling | Uses direct forward passes |
| Training often stochastic | Training usually gradient-based |

An RBM defines a joint probability distribution over visible and hidden variables. An autoencoder directly maps inputs to latent vectors and reconstructs them.

Despite these differences, both attempt to discover useful structure in unlabeled data.

### PyTorch Implementation

A minimal RBM implementation uses matrix multiplications and Bernoulli sampling.

```python id="m4xw9u"
import torch
from torch import nn

class RBM(nn.Module):
    def __init__(self, n_visible: int, n_hidden: int):
        super().__init__()

        self.W = nn.Parameter(
            torch.randn(n_visible, n_hidden) * 0.01
        )

        self.v_bias = nn.Parameter(torch.zeros(n_visible))
        self.h_bias = nn.Parameter(torch.zeros(n_hidden))

    def hidden_prob(self, v):
        return torch.sigmoid(v @ self.W + self.h_bias)

    def visible_prob(self, h):
        return torch.sigmoid(h @ self.W.T + self.v_bias)

    def sample_hidden(self, v):
        p = self.hidden_prob(v)
        h = torch.bernoulli(p)
        return p, h

    def sample_visible(self, h):
        p = self.visible_prob(h)
        v = torch.bernoulli(p)
        return p, v
```

A simple CD-1 training step:

```python id="ebjg3g"
def cd1_step(rbm, v0):
    _, h0 = rbm.sample_hidden(v0)
    _, v1 = rbm.sample_visible(h0)
    _, h1 = rbm.sample_hidden(v1)

    positive_grad = v0.T @ h0
    negative_grad = v1.T @ h1

    return positive_grad - negative_grad
```

This code illustrates the core learning idea: reinforce correlations in data and weaken correlations in reconstructions.

### Limitations of RBMs

RBMs have several important limitations.

First, sampling-based training is slow compared with direct gradient-based optimization in feedforward networks.

Second, the partition function remains difficult to estimate.

Third, likelihood evaluation is expensive.

Fourth, Markov chains can mix slowly in high-dimensional spaces.

Fifth, modern architectures such as transformers and diffusion models scale more effectively to massive datasets.

As a result, RBMs are now used primarily for historical understanding, specialized probabilistic modeling, and research into energy-based learning.

### Summary

Restricted Boltzmann machines simplify Boltzmann machines by using a bipartite graph structure. This restriction creates conditional independence properties that make inference and Gibbs sampling efficient.

An RBM defines an energy function over visible and hidden variables. Learning lowers the energy of training data and raises the energy of reconstructed samples. Contrastive divergence provides a practical approximate training method.

RBMs were historically important for unsupervised representation learning and deep belief networks. Although they are less common in modern large-scale deep learning, they remain foundational for understanding energy-based models, probabilistic latent-variable learning, and the evolution of deep neural networks.