Energy-Based Models

Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly. Instead of predicting a probability with a softmax layer or autoregressive factorization, an energy-based model assigns a scalar energy to each configuration of variables.

Low-energy configurations are treated as plausible. High-energy configurations are treated as unlikely.

Energy-based modeling is one of the most general frameworks in machine learning. Boltzmann machines, restricted Boltzmann machines, Hopfield networks, conditional random fields, score-based models, and some interpretations of diffusion models can all be viewed as energy-based systems.

The central idea is simple:

data should have low energy,
unrealistic configurations should have high energy.

Energy Functions

Let

x

denote an input configuration. An energy-based model defines a scalar-valued function

E_\theta(x),

where $\theta$ denotes model parameters.

The energy function maps each configuration to a real number:

E_\theta : \mathcal{X} \rightarrow \mathbb{R}.

Low energy corresponds to preferred states.

Unlike standard classifiers, the energy itself has no probabilistic meaning unless normalized.

From Energy to Probability

An energy function can be converted into a probability distribution using the Boltzmann distribution:

p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta},

where

Z_\theta = \int \exp(-E_\theta(x))dx

or, in discrete spaces,

Z_\theta = \sum_x \exp(-E_\theta(x)).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“p_\theta(x)=\frac{\exp(-E_\theta(x))}{Z_\theta}”}}

The quantity

Z_\theta

is the partition function. It normalizes the distribution so probabilities sum or integrate to one.

The partition function is usually difficult to compute because it depends on all possible configurations of $x$ .

This difficulty is central to energy-based learning.

Intuition Behind Energy Landscapes

An energy-based model defines an energy surface over the input space.

Imagine a landscape:

valleys correspond to low-energy regions,
hills correspond to high-energy regions.

Training attempts to reshape this landscape so that real data lies inside valleys.

For image modeling:

realistic images should occupy low-energy regions,
random noise should occupy high-energy regions.

For language modeling:

grammatically coherent sequences should have low energy,
nonsensical sequences should have high energy.

Discriminative Energy-Based Models

Energy-based models do not have to model full probability distributions.

In discriminative settings, the energy depends on both an input and a label:

E_\theta(x,y).

The predicted label minimizes energy:

\hat{y} = \arg\min_y E_\theta(x,y).

Instead of computing probabilities explicitly, the model simply selects the lowest-energy output.

This view unifies many learning systems.

For example, a softmax classifier can be interpreted as an energy model with

E(x,y) = -f_y(x),

where $f_y(x)$ is the logit score for class $y$ .

Learning Objectives

Energy-based models are trained by shaping the energy function.

A common objective is maximum likelihood estimation:

\max_\theta \sum_n \log p_\theta(x^{(n)}).

Using the Boltzmann distribution,

\log p_\theta(x) = -E_\theta(x) - \log Z_\theta.

The gradient becomes

\nabla_\theta \log p_\theta(x) = - \nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta(x')} [ \nabla_\theta E_\theta(x') ].

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\nabla_\theta \log p_\theta(x)=-\nabla_\theta E_\theta(x)+\mathbb{E}{p\theta(x’)}[\nabla_\theta E_\theta(x’)]"}} 

This equation contains two opposing forces:

Term	Effect
Data term	Lowers energy of real data
Model expectation term	Raises energy of model-generated states

Training therefore pushes the model to distinguish observed data from alternative configurations.

Contrastive Learning Interpretation

Many modern contrastive learning methods can be interpreted as energy-based learning.

Suppose we have:

positive examples $(x,x^+)$ ,
negative examples $(x,x^-)$ .

An energy function assigns compatibility scores:

E(x,x').

Training minimizes energy for positive pairs and maximizes energy for negative pairs.

For example, InfoNCE-style objectives often behave like normalized energy-based objectives.

Modern representation learning systems such as:

SimCLR,
CLIP,
MoCo,
contrastive language-image models,

all contain strong energy-based interpretations.

Margin-Based Energy Objectives

Some EBMs use margin losses instead of normalized likelihoods.

Suppose:

$x^+$ is a positive sample,
$x^-$ is a negative sample.

A margin objective is

L = \max(0, m + E(x^+) - E(x^-)),

where $m$ is a margin constant.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“L=\max(0,m+E(x^+)-E(x^-))”}}

The model is penalized when negative examples have energy too close to or lower than positive examples.

This avoids computing partition functions entirely.

Sampling in Energy-Based Models

Many EBMs require sampling from the model distribution.

The probability distribution satisfies

p(x) \propto \exp(-E(x)).

Sampling therefore tends to move toward low-energy regions.

Common sampling methods include:

Method	Idea
Gibbs sampling	Update variables conditionally
Langevin dynamics	Gradient-based stochastic sampling
Hamiltonian Monte Carlo	Momentum-based exploration
Metropolis-Hastings	Accept/reject transitions

Langevin Dynamics

Langevin dynamics is widely used in continuous EBMs.

The update rule is

x_{t+1} = x_t - \alpha \nabla_x E_\theta(x_t) + \sqrt{2\alpha}\,\epsilon_t,

where:

Symbol	Meaning
$\alpha$	Step size
$\epsilon_t$	Gaussian noise

genui{“math_block_widget_always_prefetch_v2”:{“content”:“x_{t+1}=x_t-\alpha\nabla_x E_\theta(x_t)+\sqrt{2\alpha}\,\epsilon_t”}}

The gradient term moves samples toward lower-energy regions. The noise term prevents collapse into a single mode.

Langevin sampling resembles gradient descent with stochastic perturbations.

Score Functions

The score function of a probability distribution is

\nabla_x \log p(x).

For energy-based models,

\log p(x) = -E(x) - \log Z.

Differentiating with respect to $x$ :

\nabla_x \log p(x) = -\nabla_x E(x).

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\nabla_x \log p(x)=-\nabla_x E(x)"}}

This relationship is fundamental.

The gradient of the energy defines the score function up to sign.

Modern score-based generative models and diffusion models build directly on this idea.

Noise Contrastive Estimation

Noise contrastive estimation, or NCE, transforms density estimation into binary classification.

The model learns to distinguish:

real data samples,
noise samples.

Instead of explicitly computing the partition function, the model trains a discriminator between data and noise.

NCE was historically important because it allowed large unnormalized probabilistic models to be trained efficiently.

Several modern self-supervised methods inherit similar principles.

Score Matching

Score matching avoids computing partition functions entirely.

Rather than matching densities directly, the model matches score functions:

\nabla_x \log p_\theta(x).

The objective minimizes differences between model score fields and data score fields.

This idea became highly influential in:

score-based generative models,
diffusion probabilistic models,
denoising score matching.

Energy-Based Classification

Standard neural classifiers often use softmax outputs:

p(y\mid x) = \frac{\exp(f_y(x))} {\sum_k \exp(f_k(x))}.

This can be rewritten as an energy model:

E(x,y) = -f_y(x).

The softmax distribution becomes

p(y\mid x) = \frac{\exp(-E(x,y))} {\sum_k \exp(-E(x,k))}.

Thus many discriminative neural networks already behave like energy-based systems.

Continuous Energy-Based Models

Modern EBMs often use continuous variables rather than binary units.

A neural network parameterizes the energy:

E_\theta(x) = f_\theta(x).

The network may take:

images,
sequences,
latent vectors,
trajectories,
multimodal inputs.

Unlike RBMs, continuous EBMs can scale to high-dimensional spaces using deep architectures.

Modern Neural Energy Models

Modern EBMs often use convolutional or transformer architectures.

Examples include:

Domain	Energy function architecture
Vision	CNN-based energy network
NLP	Transformer encoder
Multimodal systems	Cross-attention networks
Reinforcement learning	State-action energy functions

The output is a scalar energy score.

Unlike autoregressive models, the model need not define a normalized conditional probability at every step.

Energy-Based Reinforcement Learning

Energy-based ideas also appear in reinforcement learning.

A policy can be represented as:

\pi(a\mid s) \propto \exp(-E(s,a)).

The energy measures the compatibility between states and actions.

This connects reinforcement learning to probabilistic inference and maximum-entropy optimization.

Relationship to Diffusion Models

Diffusion models are closely related to EBMs.

A diffusion model learns a score field:

\nabla_x \log p(x).

From the energy-based perspective:

\nabla_x \log p(x) = -\nabla_x E(x).

Thus diffusion systems implicitly learn gradients of an energy landscape.

The reverse diffusion process can be interpreted as iterative movement toward lower-energy regions guided by learned score estimates.

Advantages of Energy-Based Models

Energy-based models have several attractive properties.

Flexible Output Spaces

EBMs can model:

structured outputs,
graphs,
sets,
sequences,
multimodal objects.

Unified Framework

Many probabilistic and discriminative models can be interpreted as EBMs.

Natural Representation Learning

Latent structure emerges through energy minimization.

Compatibility with Sampling

Sampling-based inference allows flexible generation mechanisms.

No Need for Explicit Normalized Outputs

The model only needs to define relative preference between configurations.

Challenges of Energy-Based Models

EBMs also have important difficulties.

Partition Functions

Exact normalization is usually intractable.

Sampling Cost

Markov chain sampling may be slow.

Training Instability

Poorly shaped energy surfaces can produce unstable dynamics.

Mode Collapse

Sampling procedures may fail to explore all modes.

Scalability

Efficient large-scale EBM training remains difficult compared with autoregressive likelihood training.

PyTorch Example

A simple neural EBM may look like this:

import torch
from torch import nn

class EnergyModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()

        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, x):
        return self.network(x).squeeze(-1)

The network outputs a scalar energy for each input.

A simple contrastive objective:

def contrastive_loss(model, positive, negative):
    pos_energy = model(positive)
    neg_energy = model(negative)

    return (pos_energy - neg_energy).mean()

The model learns to assign lower energy to positive samples than to negative samples.

Relationship to Modern Deep Learning

Energy-based thinking appears throughout modern AI.

Area	Energy-based interpretation
Contrastive learning	Compatibility energies
Diffusion models	Score gradients
Reinforcement learning	Energy over state-action pairs
Structured prediction	Energy minimization
Vision-language models	Cross-modal compatibility
Retrieval systems	Similarity energies

Even when modern systems are not explicitly called EBMs, many optimize implicit energy landscapes.

Summary

Energy-based models represent probability and compatibility through scalar energy functions. Low-energy configurations correspond to plausible states, while high-energy configurations correspond to unlikely states.

EBMs provide a highly general framework connecting probabilistic modeling, discriminative learning, sampling, representation learning, and generative modeling. Boltzmann machines, contrastive learning systems, diffusion models, and many modern representation-learning methods all contain energy-based interpretations.

The framework is mathematically elegant and broadly expressive, but practical training remains challenging because normalization and sampling are often computationally expensive. Despite these difficulties, energy-based ideas continue to influence modern deep learning research across generative modeling, reinforcement learning, multimodal systems, and representation learning.