Likelihood-Based Objectives

Many deep learning loss functions can be understood as likelihood maximization. Instead of viewing training as minimizing an arbitrary error measure, we model the probability distribution of the data and choose parameters that make the observed data likely under that distribution.

This viewpoint unifies regression, classification, sequence modeling, generative modeling, and probabilistic inference.

Suppose a model with parameters $\theta$ defines a probability distribution

p_\theta(y \mid x).

Given a dataset

\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n},

training seeks parameters that maximize the probability of the observed targets:

\max_\theta \prod_{i=1}^{n} p_\theta(y_i \mid x_i).

This quantity is called the likelihood.

Because products of probabilities can become extremely small, optimization is usually performed on the log-likelihood:

\max_\theta \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Deep learning frameworks conventionally minimize losses rather than maximize objectives, so we define the negative log-likelihood:

L(\theta) = - \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

This becomes the training loss.

Why Likelihood Matters

Likelihood-based learning provides a principled connection between optimization and probability theory.

A neural network is not merely fitting outputs. It is estimating a probability distribution over possible outputs conditioned on the input.

This interpretation has several advantages:

Advantage	Meaning
Statistical interpretation	Training corresponds to probabilistic inference
Uncertainty modeling	The model can express confidence
Unified framework	Regression, classification, and generation use the same principle
Theoretical grounding	Losses arise from distribution assumptions
Generative capability	The model can sample outputs

Different losses correspond to different assumptions about the data distribution.

For example:

Distribution assumption	Resulting loss
Gaussian noise	Mean squared error
Bernoulli outputs	Binary cross-entropy
Categorical outputs	Cross-entropy
Laplace noise	Mean absolute error
Poisson counts	Poisson negative log-likelihood

Thus, choosing a loss implicitly chooses a probabilistic model of the data.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) chooses parameters that maximize the probability of the observed data.

Suppose the dataset contains independent examples. The likelihood is

\mathcal{L}(\theta) = \prod_{i=1}^{n} p_\theta(y_i \mid x_i).

Taking logarithms gives

\log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

The logarithm converts multiplication into addition, which improves numerical stability and simplifies differentiation.

The negative log-likelihood loss is

L(\theta) = - \sum_{i=1}^{n} \log p_\theta(y_i \mid x_i).

Training minimizes this quantity.

In minibatch training, the average negative log-likelihood is often used:

L(\theta) = - \frac{1}{B} \sum_{i=1}^{B} \log p_\theta(y_i \mid x_i).

This is the standard objective for most deep learning systems.

Gaussian Likelihood and Mean Squared Error

Suppose the target variable is continuous and the model predicts a mean value:

\mu_\theta(x).

Assume the target distribution is Gaussian:

y \mid x \sim \mathcal{N}(\mu_\theta(x), \sigma^2).

The Gaussian density is

p(y \mid x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( - \frac{(y-\mu_\theta(x))^2}{2\sigma^2} \right).

The negative log-likelihood is

-\log p(y \mid x) = \frac{(y-\mu_\theta(x))^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2).

If $\sigma^2$ is constant, the second term does not depend on the model parameters. Minimizing the negative log-likelihood therefore becomes equivalent to minimizing squared error:

L \propto (y-\mu_\theta(x))^2.

Thus, mean squared error corresponds to maximum likelihood estimation under Gaussian noise.

Bernoulli Likelihood and Binary Cross-Entropy

Suppose the target is binary:

y \in \{0,1\}.

The model predicts a probability

p_\theta(x) = P(y=1 \mid x).

The Bernoulli distribution defines

P(y \mid x) = p_\theta(x)^y (1-p_\theta(x))^{1-y}.

The log-likelihood is

\log P(y \mid x) = y\log p_\theta(x) + (1-y)\log(1-p_\theta(x)).

The negative log-likelihood becomes

L = - \left[ y\log p_\theta(x) + (1-y)\log(1-p_\theta(x)) \right].

This is binary cross-entropy.

In neural networks, the probability is usually produced using a sigmoid function:

p_\theta(x) = \sigma(z) = \frac{1}{1+\exp(-z)}.

PyTorch combines the sigmoid and binary cross-entropy into a stable implementation:

loss_fn = torch.nn.BCEWithLogitsLoss()

Categorical Likelihood and Cross-Entropy

For multiclass classification, suppose there are $K$ possible classes. The model predicts a probability vector

p_\theta(y=j \mid x) = p_j.

The categorical distribution assigns probability

P(y=c \mid x) = p_c.

The negative log-likelihood is

L = -\log p_c.

Using one-hot labels $y_j$ , this becomes

L = - \sum_{j=1}^{K} y_j \log p_j.

This is the cross-entropy loss.

The probabilities are usually produced using softmax:

p_j = \frac{\exp(z_j)} {\sum_{k=1}^{K}\exp(z_k)}.

In PyTorch:

loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

Again, the framework combines softmax and log operations internally for numerical stability.

Sequence Likelihoods

In sequence modeling, the objective is usually the probability of an entire sequence.

Suppose a sequence is

x = (x_1, x_2, \ldots, x_T).

Using the chain rule of probability,

P(x) = \prod_{t=1}^{T} P(x_t \mid x_{<t}),

where

x_{<t} = (x_1,\ldots,x_{t-1}).

The log-likelihood becomes

\log P(x) = \sum_{t=1}^{T} \log P(x_t \mid x_{<t}).

The training loss is the negative log-likelihood:

L = - \sum_{t=1}^{T} \log P(x_t \mid x_{<t}).

Transformer language models such as GPT are trained using this objective.

In practice, the model predicts logits over a vocabulary at every position.

For vocabulary size $V$ , logits have shape

[B,T,V].

Targets have shape

[B,T].

PyTorch implementation:

B, T, V = logits.shape

loss = torch.nn.CrossEntropyLoss()(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

This objective trains the model to assign high probability to the observed next token.

Likelihood and Probabilistic Outputs

Likelihood-based models output distributions rather than single values.

For example, instead of predicting one scalar, a regression model may predict both mean and variance:

\mu_\theta(x), \qquad \sigma_\theta^2(x).

The predictive distribution becomes

y \mid x \sim \mathcal{N} ( \mu_\theta(x), \sigma_\theta^2(x) ).

The negative log-likelihood is

L = \frac{(y-\mu)^2}{2\sigma^2} + \frac{1}{2}\log\sigma^2.

This allows the model to represent uncertainty. Regions with high noise can receive larger predicted variance.

In PyTorch:

mean = model_mean(x)
log_var = model_logvar(x)

variance = torch.exp(log_var)

loss = (
    ((y - mean) ** 2) / (2 * variance)
    + 0.5 * log_var
).mean()

This approach is common in probabilistic deep learning and Bayesian modeling.

Likelihood and Energy Functions

Many probabilistic models define an energy function:

E_\theta(x).

Lower energy corresponds to higher probability. The probability distribution is defined as

p_\theta(x) = \frac{\exp(-E_\theta(x))} {Z_\theta},

where

Z_\theta = \sum_x \exp(-E_\theta(x))

is the partition function.

The negative log-likelihood becomes

-\log p_\theta(x) = E_\theta(x) + \log Z_\theta.

This framework appears in:

Model type	Example
Boltzmann machines	Energy-based binary networks
Restricted Boltzmann machines	Layered stochastic networks
Contrastive energy models	Representation learning
Score-based diffusion models	Denoising objectives

The partition function is often computationally expensive. Many energy-based methods therefore use approximations.

Likelihood in Variational Autoencoders

Variational autoencoders (VAEs) define a latent-variable model:

p_\theta(x,z) = p_\theta(x \mid z)p(z).

The likelihood of a data point is

p_\theta(x) = \int p_\theta(x \mid z)p(z)\,dz.

This integral is usually intractable. VAEs therefore optimize a lower bound called the evidence lower bound (ELBO):

\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)} [ \log p_\theta(x \mid z) ] - D_{\mathrm{KL}} ( q_\phi(z \mid x) \| p(z) ).

The first term is a reconstruction likelihood. The second term regularizes the latent distribution.

VAEs therefore combine likelihood maximization with probabilistic latent inference.

Likelihood and Diffusion Models

Diffusion models also use likelihood-related objectives, although the training procedure is indirect.

A diffusion process gradually adds noise:

x_0 \to x_1 \to x_2 \to \cdots \to x_T.

The model learns the reverse process:

p_\theta(x_{t-1} \mid x_t).

Training objectives can be derived from variational likelihood bounds. In practice, many diffusion models minimize denoising objectives equivalent to weighted likelihood optimization.

For example, a common diffusion loss is

L = \mathbb{E} \| \epsilon - \epsilon_\theta(x_t,t) \|^2.

Although this resembles mean squared error, it arises from probabilistic latent-variable modeling.

Numerical Stability

Likelihood objectives frequently involve logarithms and exponentials. Direct computation can become unstable.

For example, computing

\log \sum_i \exp(z_i)

naively may overflow when logits are large.

Stable implementations use the log-sum-exp trick:

\log \sum_i \exp(z_i) = m + \log \sum_i \exp(z_i - m),

where

m = \max_i z_i.

This subtraction prevents exponential overflow.

PyTorch internally applies stable implementations for:

Function	Stable implementation
`CrossEntropyLoss`	`log_softmax + NLLLoss`
`BCEWithLogitsLoss`	sigmoid + BCE
`logsumexp`	stable exponential normalization

These details are important because unstable probability computations can produce NaNs during training.

Likelihood and Calibration

Likelihood optimization encourages calibrated probabilities.

A calibrated classifier outputs probabilities consistent with empirical frequencies. For example, among predictions with confidence $0.8$ , roughly $80\%$ should be correct.

Cross-entropy strongly penalizes overconfident incorrect predictions. This often improves calibration relative to losses that ignore probability structure.

However, modern deep networks can still become miscalibrated, especially after extreme scaling or overtraining.

Calibration techniques include:

Method	Purpose
Temperature scaling	Softens logits
Label smoothing	Reduces overconfidence
Ensembles	Improves uncertainty estimates
Bayesian inference	Models posterior uncertainty

Likelihood-based training provides a natural framework for these methods.

Generative Modeling Objectives

Generative models often optimize likelihood directly or approximately.

Examples include:

Model	Objective
Autoregressive transformer	Sequence likelihood
VAE	Variational lower bound
Normalizing flow	Exact likelihood
Diffusion model	Variational denoising objective
Energy-based model	Energy likelihood
GAN	Adversarial divergence objective

Not all generative models maximize exact likelihood. GANs, for example, optimize an adversarial objective instead.

However, likelihood remains one of the central principles of probabilistic deep learning.

Likelihood Versus Distance-Based Losses

Likelihood objectives differ from purely geometric error measures.

Suppose two models produce the same prediction error magnitude. A likelihood-based model may still assign different confidence levels.

For example:

Prediction	Confidence	Cross-entropy behavior
Correct, low confidence	Moderate loss
Correct, high confidence	Small loss
Wrong, high confidence	Very large loss

This probabilistic structure makes likelihood-based objectives especially effective for classification and generative modeling.

Distance-based losses such as MSE measure only numerical difference. Likelihood-based losses model uncertainty and probability structure.

Practical Guidelines

Use likelihood-based objectives whenever the model predicts probabilities or distributions.

For regression with Gaussian assumptions, use MSE. For binary classification, use BCEWithLogitsLoss. For multiclass classification, use CrossEntropyLoss. For language modeling, use token-level cross-entropy. For probabilistic regression, predict both mean and variance. For generative models, derive objectives from the underlying probabilistic model whenever possible.

In modern deep learning, most major objectives are likelihood objectives in disguise. Cross-entropy, binary cross-entropy, negative log-likelihood, autoregressive token prediction, and many variational objectives all arise from the same principle:

\text{maximize the probability of the observed data under the model}.