Skip to content

Likelihood-Based Objectives

Many deep learning loss functions can be understood as likelihood maximization.

Many deep learning loss functions can be understood as likelihood maximization. Instead of viewing training as minimizing an arbitrary error measure, we model the probability distribution of the data and choose parameters that make the observed data likely under that distribution.

This viewpoint unifies regression, classification, sequence modeling, generative modeling, and probabilistic inference.

Suppose a model with parameters θ\theta defines a probability distribution

pθ(yx). p_\theta(y \mid x).

Given a dataset

D={(xi,yi)}i=1n, \mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n},

training seeks parameters that maximize the probability of the observed targets:

maxθi=1npθ(yixi). \max_\theta \prod_{i=1}^{n} p_\theta(y_i \mid x_i).

This quantity is called the likelihood.

Because products of probabilities can become extremely small, optimization is usually performed on the log-likelihood:

maxθi=1nlogpθ(yixi). \max_\theta \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Deep learning frameworks conventionally minimize losses rather than maximize objectives, so we define the negative log-likelihood:

L(θ)=i=1nlogpθ(yixi). L(\theta) = - \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

This becomes the training loss.

Why Likelihood Matters

Likelihood-based learning provides a principled connection between optimization and probability theory.

A neural network is not merely fitting outputs. It is estimating a probability distribution over possible outputs conditioned on the input.

This interpretation has several advantages:

AdvantageMeaning
Statistical interpretationTraining corresponds to probabilistic inference
Uncertainty modelingThe model can express confidence
Unified frameworkRegression, classification, and generation use the same principle
Theoretical groundingLosses arise from distribution assumptions
Generative capabilityThe model can sample outputs

Different losses correspond to different assumptions about the data distribution.

For example:

Distribution assumptionResulting loss
Gaussian noiseMean squared error
Bernoulli outputsBinary cross-entropy
Categorical outputsCross-entropy
Laplace noiseMean absolute error
Poisson countsPoisson negative log-likelihood

Thus, choosing a loss implicitly chooses a probabilistic model of the data.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) chooses parameters that maximize the probability of the observed data.

Suppose the dataset contains independent examples. The likelihood is

L(θ)=i=1npθ(yixi). \mathcal{L}(\theta) = \prod_{i=1}^{n} p_\theta(y_i \mid x_i).

Taking logarithms gives

logL(θ)=i=1nlogpθ(yixi). \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

The logarithm converts multiplication into addition, which improves numerical stability and simplifies differentiation.

The negative log-likelihood loss is

L(θ)=i=1nlogpθ(yixi). L(\theta) = - \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Training minimizes this quantity.

In minibatch training, the average negative log-likelihood is often used:

L(θ)=1Bi=1Blogpθ(yixi). L(\theta) = - \frac{1}{B} \sum_{i=1}^{B} \log p_\theta(y_i \mid x_i).

This is the standard objective for most deep learning systems.

Gaussian Likelihood and Mean Squared Error

Suppose the target variable is continuous and the model predicts a mean value:

μθ(x). \mu_\theta(x).

Assume the target distribution is Gaussian:

yxN(μθ(x),σ2). y \mid x \sim \mathcal{N}(\mu_\theta(x), \sigma^2).

The Gaussian density is

p(yx)=12πσ2exp((yμθ(x))22σ2). p(y \mid x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( - \frac{(y-\mu_\theta(x))^2}{2\sigma^2} \right).

The negative log-likelihood is

logp(yx)=(yμθ(x))22σ2+12log(2πσ2). -\log p(y \mid x) = \frac{(y-\mu_\theta(x))^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2).

If σ2\sigma^2 is constant, the second term does not depend on the model parameters. Minimizing the negative log-likelihood therefore becomes equivalent to minimizing squared error:

L(yμθ(x))2. L \propto (y-\mu_\theta(x))^2.

Thus, mean squared error corresponds to maximum likelihood estimation under Gaussian noise.

Bernoulli Likelihood and Binary Cross-Entropy

Suppose the target is binary:

y{0,1}. y \in \{0,1\}.

The model predicts a probability

pθ(x)=P(y=1x). p_\theta(x) = P(y=1 \mid x).

The Bernoulli distribution defines

P(yx)=pθ(x)y(1pθ(x))1y. P(y \mid x) = p_\theta(x)^y (1-p_\theta(x))^{1-y}.

The log-likelihood is

logP(yx)=ylogpθ(x)+(1y)log(1pθ(x)). \log P(y \mid x) = y\log p_\theta(x) + (1-y)\log(1-p_\theta(x)).

The negative log-likelihood becomes

L=[ylogpθ(x)+(1y)log(1pθ(x))]. L = - \left[ y\log p_\theta(x) + (1-y)\log(1-p_\theta(x)) \right].

This is binary cross-entropy.

In neural networks, the probability is usually produced using a sigmoid function:

pθ(x)=σ(z)=11+exp(z). p_\theta(x) = \sigma(z) = \frac{1}{1+\exp(-z)}.

PyTorch combines the sigmoid and binary cross-entropy into a stable implementation:

loss_fn = torch.nn.BCEWithLogitsLoss()

Categorical Likelihood and Cross-Entropy

For multiclass classification, suppose there are KK possible classes. The model predicts a probability vector

pθ(y=jx)=pj. p_\theta(y=j \mid x) = p_j.

The categorical distribution assigns probability

P(y=cx)=pc. P(y=c \mid x) = p_c.

The negative log-likelihood is

L=logpc. L = -\log p_c.

Using one-hot labels yjy_j, this becomes

L=j=1Kyjlogpj. L = - \sum_{j=1}^{K} y_j \log p_j.

This is the cross-entropy loss.

The probabilities are usually produced using softmax:

pj=exp(zj)k=1Kexp(zk). p_j = \frac{\exp(z_j)} {\sum_{k=1}^{K}\exp(z_k)}.

In PyTorch:

loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

Again, the framework combines softmax and log operations internally for numerical stability.

Sequence Likelihoods

In sequence modeling, the objective is usually the probability of an entire sequence.

Suppose a sequence is

x=(x1,x2,,xT). x = (x_1, x_2, \ldots, x_T).

Using the chain rule of probability,

P(x)=t=1TP(xtx<t), P(x) = \prod_{t=1}^{T} P(x_t \mid x_{<t}),

where

x<t=(x1,,xt1). x_{<t} = (x_1,\ldots,x_{t-1}).

The log-likelihood becomes

logP(x)=t=1TlogP(xtx<t). \log P(x) = \sum_{t=1}^{T} \log P(x_t \mid x_{<t}).

The training loss is the negative log-likelihood:

L=t=1TlogP(xtx<t). L = - \sum_{t=1}^{T} \log P(x_t \mid x_{<t}).

Transformer language models such as GPT are trained using this objective.

In practice, the model predicts logits over a vocabulary at every position.

For vocabulary size VV, logits have shape

[B,T,V]. [B,T,V].

Targets have shape

[B,T]. [B,T].

PyTorch implementation:

B, T, V = logits.shape

loss = torch.nn.CrossEntropyLoss()(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

This objective trains the model to assign high probability to the observed next token.

Likelihood and Probabilistic Outputs

Likelihood-based models output distributions rather than single values.

For example, instead of predicting one scalar, a regression model may predict both mean and variance:

μθ(x),σθ2(x). \mu_\theta(x), \qquad \sigma_\theta^2(x).

The predictive distribution becomes

yxN(μθ(x),σθ2(x)). y \mid x \sim \mathcal{N} ( \mu_\theta(x), \sigma_\theta^2(x) ).

The negative log-likelihood is

L=(yμ)22σ2+12logσ2. L = \frac{(y-\mu)^2}{2\sigma^2} + \frac{1}{2}\log\sigma^2.

This allows the model to represent uncertainty. Regions with high noise can receive larger predicted variance.

In PyTorch:

mean = model_mean(x)
log_var = model_logvar(x)

variance = torch.exp(log_var)

loss = (
    ((y - mean) ** 2) / (2 * variance)
    + 0.5 * log_var
).mean()

This approach is common in probabilistic deep learning and Bayesian modeling.

Likelihood and Energy Functions

Many probabilistic models define an energy function:

Eθ(x). E_\theta(x).

Lower energy corresponds to higher probability. The probability distribution is defined as

pθ(x)=exp(Eθ(x))Zθ, p_\theta(x) = \frac{\exp(-E_\theta(x))} {Z_\theta},

where

Zθ=xexp(Eθ(x)) Z_\theta = \sum_x \exp(-E_\theta(x))

is the partition function.

The negative log-likelihood becomes

logpθ(x)=Eθ(x)+logZθ. -\log p_\theta(x) = E_\theta(x) + \log Z_\theta.

This framework appears in:

Model typeExample
Boltzmann machinesEnergy-based binary networks
Restricted Boltzmann machinesLayered stochastic networks
Contrastive energy modelsRepresentation learning
Score-based diffusion modelsDenoising objectives

The partition function is often computationally expensive. Many energy-based methods therefore use approximations.

Likelihood in Variational Autoencoders

Variational autoencoders (VAEs) define a latent-variable model:

pθ(x,z)=pθ(xz)p(z). p_\theta(x,z) = p_\theta(x \mid z)p(z).

The likelihood of a data point is

pθ(x)=pθ(xz)p(z)dz. p_\theta(x) = \int p_\theta(x \mid z)p(z)\,dz.

This integral is usually intractable. VAEs therefore optimize a lower bound called the evidence lower bound (ELBO):

logpθ(x)Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z)). \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)} [ \log p_\theta(x \mid z) ] - D_{\mathrm{KL}} ( q_\phi(z \mid x) \| p(z) ).

The first term is a reconstruction likelihood. The second term regularizes the latent distribution.

VAEs therefore combine likelihood maximization with probabilistic latent inference.

Likelihood and Diffusion Models

Diffusion models also use likelihood-related objectives, although the training procedure is indirect.

A diffusion process gradually adds noise:

x0x1x2xT. x_0 \to x_1 \to x_2 \to \cdots \to x_T.

The model learns the reverse process:

pθ(xt1xt). p_\theta(x_{t-1} \mid x_t).

Training objectives can be derived from variational likelihood bounds. In practice, many diffusion models minimize denoising objectives equivalent to weighted likelihood optimization.

For example, a common diffusion loss is

L=Eϵϵθ(xt,t)2. L = \mathbb{E} \| \epsilon - \epsilon_\theta(x_t,t) \|^2.

Although this resembles mean squared error, it arises from probabilistic latent-variable modeling.

Numerical Stability

Likelihood objectives frequently involve logarithms and exponentials. Direct computation can become unstable.

For example, computing

logiexp(zi) \log \sum_i \exp(z_i)

naively may overflow when logits are large.

Stable implementations use the log-sum-exp trick:

logiexp(zi)=m+logiexp(zim), \log \sum_i \exp(z_i) = m + \log \sum_i \exp(z_i - m),

where

m=maxizi. m = \max_i z_i.

This subtraction prevents exponential overflow.

PyTorch internally applies stable implementations for:

FunctionStable implementation
CrossEntropyLosslog_softmax + NLLLoss
BCEWithLogitsLosssigmoid + BCE
logsumexpstable exponential normalization

These details are important because unstable probability computations can produce NaNs during training.

Likelihood and Calibration

Likelihood optimization encourages calibrated probabilities.

A calibrated classifier outputs probabilities consistent with empirical frequencies. For example, among predictions with confidence 0.80.8, roughly 80%80\% should be correct.

Cross-entropy strongly penalizes overconfident incorrect predictions. This often improves calibration relative to losses that ignore probability structure.

However, modern deep networks can still become miscalibrated, especially after extreme scaling or overtraining.

Calibration techniques include:

MethodPurpose
Temperature scalingSoftens logits
Label smoothingReduces overconfidence
EnsemblesImproves uncertainty estimates
Bayesian inferenceModels posterior uncertainty

Likelihood-based training provides a natural framework for these methods.

Generative Modeling Objectives

Generative models often optimize likelihood directly or approximately.

Examples include:

ModelObjective
Autoregressive transformerSequence likelihood
VAEVariational lower bound
Normalizing flowExact likelihood
Diffusion modelVariational denoising objective
Energy-based modelEnergy likelihood
GANAdversarial divergence objective

Not all generative models maximize exact likelihood. GANs, for example, optimize an adversarial objective instead.

However, likelihood remains one of the central principles of probabilistic deep learning.

Likelihood Versus Distance-Based Losses

Likelihood objectives differ from purely geometric error measures.

Suppose two models produce the same prediction error magnitude. A likelihood-based model may still assign different confidence levels.

For example:

PredictionConfidenceCross-entropy behavior
Correct, low confidenceModerate loss
Correct, high confidenceSmall loss
Wrong, high confidenceVery large loss

This probabilistic structure makes likelihood-based objectives especially effective for classification and generative modeling.

Distance-based losses such as MSE measure only numerical difference. Likelihood-based losses model uncertainty and probability structure.

Practical Guidelines

Use likelihood-based objectives whenever the model predicts probabilities or distributions.

For regression with Gaussian assumptions, use MSE. For binary classification, use BCEWithLogitsLoss. For multiclass classification, use CrossEntropyLoss. For language modeling, use token-level cross-entropy. For probabilistic regression, predict both mean and variance. For generative models, derive objectives from the underlying probabilistic model whenever possible.

In modern deep learning, most major objectives are likelihood objectives in disguise. Cross-entropy, binary cross-entropy, negative log-likelihood, autoregressive token prediction, and many variational objectives all arise from the same principle:

maximize the probability of the observed data under the model. \text{maximize the probability of the observed data under the model}.