# Energy-Based Models

Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly. Instead of predicting a probability with a softmax layer or autoregressive factorization, an energy-based model assigns a scalar energy to each configuration of variables.

Low-energy configurations are treated as plausible. High-energy configurations are treated as unlikely.

Energy-based modeling is one of the most general frameworks in machine learning. Boltzmann machines, restricted Boltzmann machines, Hopfield networks, conditional random fields, score-based models, and some interpretations of diffusion models can all be viewed as energy-based systems.

The central idea is simple:

- data should have low energy,
- unrealistic configurations should have high energy.

### Energy Functions

Let

$$
x
$$

denote an input configuration. An energy-based model defines a scalar-valued function

$$
E_\theta(x),
$$

where $\theta$ denotes model parameters.

The energy function maps each configuration to a real number:

$$
E_\theta : \mathcal{X} \rightarrow \mathbb{R}.
$$

Low energy corresponds to preferred states.

Unlike standard classifiers, the energy itself has no probabilistic meaning unless normalized.

### From Energy to Probability

An energy function can be converted into a probability distribution using the Boltzmann distribution:

$$
p_\theta(x) =
\frac{\exp(-E_\theta(x))}{Z_\theta},
$$

where

$$
Z_\theta =
\int \exp(-E_\theta(x))dx
$$

or, in discrete spaces,

$$
Z_\theta =
\sum_x \exp(-E_\theta(x)).
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"p_\\theta(x)=\\frac{\\exp(-E_\\theta(x))}{Z_\\theta}"}}

The quantity

$$
Z_\theta
$$

is the partition function. It normalizes the distribution so probabilities sum or integrate to one.

The partition function is usually difficult to compute because it depends on all possible configurations of $x$.

This difficulty is central to energy-based learning.

### Intuition Behind Energy Landscapes

An energy-based model defines an energy surface over the input space.

Imagine a landscape:

- valleys correspond to low-energy regions,
- hills correspond to high-energy regions.

Training attempts to reshape this landscape so that real data lies inside valleys.

For image modeling:

- realistic images should occupy low-energy regions,
- random noise should occupy high-energy regions.

For language modeling:

- grammatically coherent sequences should have low energy,
- nonsensical sequences should have high energy.

### Discriminative Energy-Based Models

Energy-based models do not have to model full probability distributions.

In discriminative settings, the energy depends on both an input and a label:

$$
E_\theta(x,y).
$$

The predicted label minimizes energy:

$$
\hat{y} =
\arg\min_y E_\theta(x,y).
$$

Instead of computing probabilities explicitly, the model simply selects the lowest-energy output.

This view unifies many learning systems.

For example, a softmax classifier can be interpreted as an energy model with

$$
E(x,y) = -f_y(x),
$$

where $f_y(x)$ is the logit score for class $y$.

### Learning Objectives

Energy-based models are trained by shaping the energy function.

A common objective is maximum likelihood estimation:

$$
\max_\theta
\sum_n \log p_\theta(x^{(n)}).
$$

Using the Boltzmann distribution,

$$
\log p_\theta(x) =
-E_\theta(x) -
\log Z_\theta.
$$

The gradient becomes

$$
\nabla_\theta \log p_\theta(x) = -
\nabla_\theta E_\theta(x)
+
\mathbb{E}_{p_\theta(x')}
[
\nabla_\theta E_\theta(x')
].
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"\\nabla_\\theta \\log p_\\theta(x)=-\\nabla_\\theta E_\\theta(x)+\\mathbb{E}_{p_\\theta(x')}[\\nabla_\\theta E_\\theta(x')]"}}


This equation contains two opposing forces:

| Term | Effect |
|---|---|
| Data term | Lowers energy of real data |
| Model expectation term | Raises energy of model-generated states |

Training therefore pushes the model to distinguish observed data from alternative configurations.

### Contrastive Learning Interpretation

Many modern contrastive learning methods can be interpreted as energy-based learning.

Suppose we have:

- positive examples $(x,x^+)$,
- negative examples $(x,x^-)$.

An energy function assigns compatibility scores:

$$
E(x,x').
$$

Training minimizes energy for positive pairs and maximizes energy for negative pairs.

For example, InfoNCE-style objectives often behave like normalized energy-based objectives.

Modern representation learning systems such as:

- SimCLR,
- CLIP,
- MoCo,
- contrastive language-image models,

all contain strong energy-based interpretations.

### Margin-Based Energy Objectives

Some EBMs use margin losses instead of normalized likelihoods.

Suppose:

- $x^+$ is a positive sample,
- $x^-$ is a negative sample.

A margin objective is

$$
L =
\max(0, m + E(x^+) - E(x^-)),
$$

where $m$ is a margin constant.

genui{"math_block_widget_always_prefetch_v2":{"content":"L=\\max(0,m+E(x^+)-E(x^-))"}}

The model is penalized when negative examples have energy too close to or lower than positive examples.

This avoids computing partition functions entirely.

### Sampling in Energy-Based Models

Many EBMs require sampling from the model distribution.

The probability distribution satisfies

$$
p(x)
\propto
\exp(-E(x)).
$$

Sampling therefore tends to move toward low-energy regions.

Common sampling methods include:

| Method | Idea |
|---|---|
| Gibbs sampling | Update variables conditionally |
| Langevin dynamics | Gradient-based stochastic sampling |
| Hamiltonian Monte Carlo | Momentum-based exploration |
| Metropolis-Hastings | Accept/reject transitions |

### Langevin Dynamics

Langevin dynamics is widely used in continuous EBMs.

The update rule is

$$
x_{t+1} =
x_t -
\alpha
\nabla_x E_\theta(x_t)
+
\sqrt{2\alpha}\,\epsilon_t,
$$

where:

| Symbol | Meaning |
|---|---|
| $\alpha$ | Step size |
| $\epsilon_t$ | Gaussian noise |

genui{"math_block_widget_always_prefetch_v2":{"content":"x_{t+1}=x_t-\\alpha\\nabla_x E_\\theta(x_t)+\\sqrt{2\\alpha}\\,\\epsilon_t"}}

The gradient term moves samples toward lower-energy regions. The noise term prevents collapse into a single mode.

Langevin sampling resembles gradient descent with stochastic perturbations.

### Score Functions

The score function of a probability distribution is

$$
\nabla_x \log p(x).
$$

For energy-based models,

$$
\log p(x) =
-E(x) -
\log Z.
$$

Differentiating with respect to $x$:

$$
\nabla_x \log p(x) =
-\nabla_x E(x).
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"\\nabla_x \\log p(x)=-\\nabla_x E(x)"}}

This relationship is fundamental.

The gradient of the energy defines the score function up to sign.

Modern score-based generative models and diffusion models build directly on this idea.

### Noise Contrastive Estimation

Noise contrastive estimation, or NCE, transforms density estimation into binary classification.

The model learns to distinguish:

- real data samples,
- noise samples.

Instead of explicitly computing the partition function, the model trains a discriminator between data and noise.

NCE was historically important because it allowed large unnormalized probabilistic models to be trained efficiently.

Several modern self-supervised methods inherit similar principles.

### Score Matching

Score matching avoids computing partition functions entirely.

Rather than matching densities directly, the model matches score functions:

$$
\nabla_x \log p_\theta(x).
$$

The objective minimizes differences between model score fields and data score fields.

This idea became highly influential in:

- score-based generative models,
- diffusion probabilistic models,
- denoising score matching.

### Energy-Based Classification

Standard neural classifiers often use softmax outputs:

$$
p(y\mid x) =
\frac{\exp(f_y(x))}
{\sum_k \exp(f_k(x))}.
$$

This can be rewritten as an energy model:

$$
E(x,y) =
-f_y(x).
$$

The softmax distribution becomes

$$
p(y\mid x) =
\frac{\exp(-E(x,y))}
{\sum_k \exp(-E(x,k))}.
$$

Thus many discriminative neural networks already behave like energy-based systems.

### Continuous Energy-Based Models

Modern EBMs often use continuous variables rather than binary units.

A neural network parameterizes the energy:

$$
E_\theta(x) =
f_\theta(x).
$$

The network may take:

- images,
- sequences,
- latent vectors,
- trajectories,
- multimodal inputs.

Unlike RBMs, continuous EBMs can scale to high-dimensional spaces using deep architectures.

### Modern Neural Energy Models

Modern EBMs often use convolutional or transformer architectures.

Examples include:

| Domain | Energy function architecture |
|---|---|
| Vision | CNN-based energy network |
| NLP | Transformer encoder |
| Multimodal systems | Cross-attention networks |
| Reinforcement learning | State-action energy functions |

The output is a scalar energy score.

Unlike autoregressive models, the model need not define a normalized conditional probability at every step.

### Energy-Based Reinforcement Learning

Energy-based ideas also appear in reinforcement learning.

A policy can be represented as:

$$
\pi(a\mid s)
\propto
\exp(-E(s,a)).
$$

The energy measures the compatibility between states and actions.

This connects reinforcement learning to probabilistic inference and maximum-entropy optimization.

### Relationship to Diffusion Models

Diffusion models are closely related to EBMs.

A diffusion model learns a score field:

$$
\nabla_x \log p(x).
$$

From the energy-based perspective:

$$
\nabla_x \log p(x) =
-\nabla_x E(x).
$$

Thus diffusion systems implicitly learn gradients of an energy landscape.

The reverse diffusion process can be interpreted as iterative movement toward lower-energy regions guided by learned score estimates.

### Advantages of Energy-Based Models

Energy-based models have several attractive properties.

#### Flexible Output Spaces

EBMs can model:

- structured outputs,
- graphs,
- sets,
- sequences,
- multimodal objects.

#### Unified Framework

Many probabilistic and discriminative models can be interpreted as EBMs.

#### Natural Representation Learning

Latent structure emerges through energy minimization.

#### Compatibility with Sampling

Sampling-based inference allows flexible generation mechanisms.

#### No Need for Explicit Normalized Outputs

The model only needs to define relative preference between configurations.

### Challenges of Energy-Based Models

EBMs also have important difficulties.

#### Partition Functions

Exact normalization is usually intractable.

#### Sampling Cost

Markov chain sampling may be slow.

#### Training Instability

Poorly shaped energy surfaces can produce unstable dynamics.

#### Mode Collapse

Sampling procedures may fail to explore all modes.

#### Scalability

Efficient large-scale EBM training remains difficult compared with autoregressive likelihood training.

### PyTorch Example

A simple neural EBM may look like this:

```python id="n1shq8"
import torch
from torch import nn

class EnergyModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()

        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, x):
        return self.network(x).squeeze(-1)
```

The network outputs a scalar energy for each input.

A simple contrastive objective:

```python id="77g03y"
def contrastive_loss(model, positive, negative):
    pos_energy = model(positive)
    neg_energy = model(negative)

    return (pos_energy - neg_energy).mean()
```

The model learns to assign lower energy to positive samples than to negative samples.

### Relationship to Modern Deep Learning

Energy-based thinking appears throughout modern AI.

| Area | Energy-based interpretation |
|---|---|
| Contrastive learning | Compatibility energies |
| Diffusion models | Score gradients |
| Reinforcement learning | Energy over state-action pairs |
| Structured prediction | Energy minimization |
| Vision-language models | Cross-modal compatibility |
| Retrieval systems | Similarity energies |

Even when modern systems are not explicitly called EBMs, many optimize implicit energy landscapes.

### Summary

Energy-based models represent probability and compatibility through scalar energy functions. Low-energy configurations correspond to plausible states, while high-energy configurations correspond to unlikely states.

EBMs provide a highly general framework connecting probabilistic modeling, discriminative learning, sampling, representation learning, and generative modeling. Boltzmann machines, contrastive learning systems, diffusion models, and many modern representation-learning methods all contain energy-based interpretations.

The framework is mathematically elegant and broadly expressive, but practical training remains challenging because normalization and sampling are often computationally expensive. Despite these difficulties, energy-based ideas continue to influence modern deep learning research across generative modeling, reinforcement learning, multimodal systems, and representation learning.