Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly.
Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly. Instead of predicting a probability with a softmax layer or autoregressive factorization, an energy-based model assigns a scalar energy to each configuration of variables.
Low-energy configurations are treated as plausible. High-energy configurations are treated as unlikely.
Energy-based modeling is one of the most general frameworks in machine learning. Boltzmann machines, restricted Boltzmann machines, Hopfield networks, conditional random fields, score-based models, and some interpretations of diffusion models can all be viewed as energy-based systems.
The central idea is simple:
- data should have low energy,
- unrealistic configurations should have high energy.
Energy Functions
Let
denote an input configuration. An energy-based model defines a scalar-valued function
where denotes model parameters.
The energy function maps each configuration to a real number:
Low energy corresponds to preferred states.
Unlike standard classifiers, the energy itself has no probabilistic meaning unless normalized.
From Energy to Probability
An energy function can be converted into a probability distribution using the Boltzmann distribution:
where
or, in discrete spaces,
genui{“math_block_widget_always_prefetch_v2”:{“content”:“p_\theta(x)=\frac{\exp(-E_\theta(x))}{Z_\theta}”}}
The quantity
is the partition function. It normalizes the distribution so probabilities sum or integrate to one.
The partition function is usually difficult to compute because it depends on all possible configurations of .
This difficulty is central to energy-based learning.
Intuition Behind Energy Landscapes
An energy-based model defines an energy surface over the input space.
Imagine a landscape:
- valleys correspond to low-energy regions,
- hills correspond to high-energy regions.
Training attempts to reshape this landscape so that real data lies inside valleys.
For image modeling:
- realistic images should occupy low-energy regions,
- random noise should occupy high-energy regions.
For language modeling:
- grammatically coherent sequences should have low energy,
- nonsensical sequences should have high energy.
Discriminative Energy-Based Models
Energy-based models do not have to model full probability distributions.
In discriminative settings, the energy depends on both an input and a label:
The predicted label minimizes energy:
Instead of computing probabilities explicitly, the model simply selects the lowest-energy output.
This view unifies many learning systems.
For example, a softmax classifier can be interpreted as an energy model with
where is the logit score for class .
Learning Objectives
Energy-based models are trained by shaping the energy function.
A common objective is maximum likelihood estimation:
Using the Boltzmann distribution,
The gradient becomes
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\nabla_\theta \log p_\theta(x)=-\nabla_\theta E_\theta(x)+\mathbb{E}{p\theta(x’)}[\nabla_\theta E_\theta(x’)]"}}
This equation contains two opposing forces:
| Term | Effect |
|---|---|
| Data term | Lowers energy of real data |
| Model expectation term | Raises energy of model-generated states |
Training therefore pushes the model to distinguish observed data from alternative configurations.
Contrastive Learning Interpretation
Many modern contrastive learning methods can be interpreted as energy-based learning.
Suppose we have:
- positive examples ,
- negative examples .
An energy function assigns compatibility scores:
Training minimizes energy for positive pairs and maximizes energy for negative pairs.
For example, InfoNCE-style objectives often behave like normalized energy-based objectives.
Modern representation learning systems such as:
- SimCLR,
- CLIP,
- MoCo,
- contrastive language-image models,
all contain strong energy-based interpretations.
Margin-Based Energy Objectives
Some EBMs use margin losses instead of normalized likelihoods.
Suppose:
- is a positive sample,
- is a negative sample.
A margin objective is
where is a margin constant.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“L=\max(0,m+E(x^+)-E(x^-))”}}
The model is penalized when negative examples have energy too close to or lower than positive examples.
This avoids computing partition functions entirely.
Sampling in Energy-Based Models
Many EBMs require sampling from the model distribution.
The probability distribution satisfies
Sampling therefore tends to move toward low-energy regions.
Common sampling methods include:
| Method | Idea |
|---|---|
| Gibbs sampling | Update variables conditionally |
| Langevin dynamics | Gradient-based stochastic sampling |
| Hamiltonian Monte Carlo | Momentum-based exploration |
| Metropolis-Hastings | Accept/reject transitions |
Langevin Dynamics
Langevin dynamics is widely used in continuous EBMs.
The update rule is
where:
| Symbol | Meaning |
|---|---|
| Step size | |
| Gaussian noise |
genui{“math_block_widget_always_prefetch_v2”:{“content”:“x_{t+1}=x_t-\alpha\nabla_x E_\theta(x_t)+\sqrt{2\alpha}\,\epsilon_t”}}
The gradient term moves samples toward lower-energy regions. The noise term prevents collapse into a single mode.
Langevin sampling resembles gradient descent with stochastic perturbations.
Score Functions
The score function of a probability distribution is
For energy-based models,
Differentiating with respect to :
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\nabla_x \log p(x)=-\nabla_x E(x)"}}
This relationship is fundamental.
The gradient of the energy defines the score function up to sign.
Modern score-based generative models and diffusion models build directly on this idea.
Noise Contrastive Estimation
Noise contrastive estimation, or NCE, transforms density estimation into binary classification.
The model learns to distinguish:
- real data samples,
- noise samples.
Instead of explicitly computing the partition function, the model trains a discriminator between data and noise.
NCE was historically important because it allowed large unnormalized probabilistic models to be trained efficiently.
Several modern self-supervised methods inherit similar principles.
Score Matching
Score matching avoids computing partition functions entirely.
Rather than matching densities directly, the model matches score functions:
The objective minimizes differences between model score fields and data score fields.
This idea became highly influential in:
- score-based generative models,
- diffusion probabilistic models,
- denoising score matching.
Energy-Based Classification
Standard neural classifiers often use softmax outputs:
This can be rewritten as an energy model:
The softmax distribution becomes
Thus many discriminative neural networks already behave like energy-based systems.
Continuous Energy-Based Models
Modern EBMs often use continuous variables rather than binary units.
A neural network parameterizes the energy:
The network may take:
- images,
- sequences,
- latent vectors,
- trajectories,
- multimodal inputs.
Unlike RBMs, continuous EBMs can scale to high-dimensional spaces using deep architectures.
Modern Neural Energy Models
Modern EBMs often use convolutional or transformer architectures.
Examples include:
| Domain | Energy function architecture |
|---|---|
| Vision | CNN-based energy network |
| NLP | Transformer encoder |
| Multimodal systems | Cross-attention networks |
| Reinforcement learning | State-action energy functions |
The output is a scalar energy score.
Unlike autoregressive models, the model need not define a normalized conditional probability at every step.
Energy-Based Reinforcement Learning
Energy-based ideas also appear in reinforcement learning.
A policy can be represented as:
The energy measures the compatibility between states and actions.
This connects reinforcement learning to probabilistic inference and maximum-entropy optimization.
Relationship to Diffusion Models
Diffusion models are closely related to EBMs.
A diffusion model learns a score field:
From the energy-based perspective:
Thus diffusion systems implicitly learn gradients of an energy landscape.
The reverse diffusion process can be interpreted as iterative movement toward lower-energy regions guided by learned score estimates.
Advantages of Energy-Based Models
Energy-based models have several attractive properties.
Flexible Output Spaces
EBMs can model:
- structured outputs,
- graphs,
- sets,
- sequences,
- multimodal objects.
Unified Framework
Many probabilistic and discriminative models can be interpreted as EBMs.
Natural Representation Learning
Latent structure emerges through energy minimization.
Compatibility with Sampling
Sampling-based inference allows flexible generation mechanisms.
No Need for Explicit Normalized Outputs
The model only needs to define relative preference between configurations.
Challenges of Energy-Based Models
EBMs also have important difficulties.
Partition Functions
Exact normalization is usually intractable.
Sampling Cost
Markov chain sampling may be slow.
Training Instability
Poorly shaped energy surfaces can produce unstable dynamics.
Mode Collapse
Sampling procedures may fail to explore all modes.
Scalability
Efficient large-scale EBM training remains difficult compared with autoregressive likelihood training.
PyTorch Example
A simple neural EBM may look like this:
import torch
from torch import nn
class EnergyModel(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def forward(self, x):
return self.network(x).squeeze(-1)The network outputs a scalar energy for each input.
A simple contrastive objective:
def contrastive_loss(model, positive, negative):
pos_energy = model(positive)
neg_energy = model(negative)
return (pos_energy - neg_energy).mean()The model learns to assign lower energy to positive samples than to negative samples.
Relationship to Modern Deep Learning
Energy-based thinking appears throughout modern AI.
| Area | Energy-based interpretation |
|---|---|
| Contrastive learning | Compatibility energies |
| Diffusion models | Score gradients |
| Reinforcement learning | Energy over state-action pairs |
| Structured prediction | Energy minimization |
| Vision-language models | Cross-modal compatibility |
| Retrieval systems | Similarity energies |
Even when modern systems are not explicitly called EBMs, many optimize implicit energy landscapes.
Summary
Energy-based models represent probability and compatibility through scalar energy functions. Low-energy configurations correspond to plausible states, while high-energy configurations correspond to unlikely states.
EBMs provide a highly general framework connecting probabilistic modeling, discriminative learning, sampling, representation learning, and generative modeling. Boltzmann machines, contrastive learning systems, diffusion models, and many modern representation-learning methods all contain energy-based interpretations.
The framework is mathematically elegant and broadly expressive, but practical training remains challenging because normalization and sampling are often computationally expensive. Despite these difficulties, energy-based ideas continue to influence modern deep learning research across generative modeling, reinforcement learning, multimodal systems, and representation learning.