Many deep learning loss functions can be understood as likelihood maximization.
Many deep learning loss functions can be understood as likelihood maximization. Instead of viewing training as minimizing an arbitrary error measure, we model the probability distribution of the data and choose parameters that make the observed data likely under that distribution.
This viewpoint unifies regression, classification, sequence modeling, generative modeling, and probabilistic inference.
Suppose a model with parameters defines a probability distribution
Given a dataset
training seeks parameters that maximize the probability of the observed targets:
This quantity is called the likelihood.
Because products of probabilities can become extremely small, optimization is usually performed on the log-likelihood:
Deep learning frameworks conventionally minimize losses rather than maximize objectives, so we define the negative log-likelihood:
This becomes the training loss.
Why Likelihood Matters
Likelihood-based learning provides a principled connection between optimization and probability theory.
A neural network is not merely fitting outputs. It is estimating a probability distribution over possible outputs conditioned on the input.
This interpretation has several advantages:
| Advantage | Meaning |
|---|---|
| Statistical interpretation | Training corresponds to probabilistic inference |
| Uncertainty modeling | The model can express confidence |
| Unified framework | Regression, classification, and generation use the same principle |
| Theoretical grounding | Losses arise from distribution assumptions |
| Generative capability | The model can sample outputs |
Different losses correspond to different assumptions about the data distribution.
For example:
| Distribution assumption | Resulting loss |
|---|---|
| Gaussian noise | Mean squared error |
| Bernoulli outputs | Binary cross-entropy |
| Categorical outputs | Cross-entropy |
| Laplace noise | Mean absolute error |
| Poisson counts | Poisson negative log-likelihood |
Thus, choosing a loss implicitly chooses a probabilistic model of the data.
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) chooses parameters that maximize the probability of the observed data.
Suppose the dataset contains independent examples. The likelihood is
Taking logarithms gives
The logarithm converts multiplication into addition, which improves numerical stability and simplifies differentiation.
The negative log-likelihood loss is
Training minimizes this quantity.
In minibatch training, the average negative log-likelihood is often used:
This is the standard objective for most deep learning systems.
Gaussian Likelihood and Mean Squared Error
Suppose the target variable is continuous and the model predicts a mean value:
Assume the target distribution is Gaussian:
The Gaussian density is
The negative log-likelihood is
If is constant, the second term does not depend on the model parameters. Minimizing the negative log-likelihood therefore becomes equivalent to minimizing squared error:
Thus, mean squared error corresponds to maximum likelihood estimation under Gaussian noise.
Bernoulli Likelihood and Binary Cross-Entropy
Suppose the target is binary:
The model predicts a probability
The Bernoulli distribution defines
The log-likelihood is
The negative log-likelihood becomes
This is binary cross-entropy.
In neural networks, the probability is usually produced using a sigmoid function:
PyTorch combines the sigmoid and binary cross-entropy into a stable implementation:
loss_fn = torch.nn.BCEWithLogitsLoss()Categorical Likelihood and Cross-Entropy
For multiclass classification, suppose there are possible classes. The model predicts a probability vector
The categorical distribution assigns probability
The negative log-likelihood is
Using one-hot labels , this becomes
This is the cross-entropy loss.
The probabilities are usually produced using softmax:
In PyTorch:
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)Again, the framework combines softmax and log operations internally for numerical stability.
Sequence Likelihoods
In sequence modeling, the objective is usually the probability of an entire sequence.
Suppose a sequence is
Using the chain rule of probability,
where
The log-likelihood becomes
The training loss is the negative log-likelihood:
Transformer language models such as GPT are trained using this objective.
In practice, the model predicts logits over a vocabulary at every position.
For vocabulary size , logits have shape
Targets have shape
PyTorch implementation:
B, T, V = logits.shape
loss = torch.nn.CrossEntropyLoss()(
logits.reshape(B * T, V),
targets.reshape(B * T),
)This objective trains the model to assign high probability to the observed next token.
Likelihood and Probabilistic Outputs
Likelihood-based models output distributions rather than single values.
For example, instead of predicting one scalar, a regression model may predict both mean and variance:
The predictive distribution becomes
The negative log-likelihood is
This allows the model to represent uncertainty. Regions with high noise can receive larger predicted variance.
In PyTorch:
mean = model_mean(x)
log_var = model_logvar(x)
variance = torch.exp(log_var)
loss = (
((y - mean) ** 2) / (2 * variance)
+ 0.5 * log_var
).mean()This approach is common in probabilistic deep learning and Bayesian modeling.
Likelihood and Energy Functions
Many probabilistic models define an energy function:
Lower energy corresponds to higher probability. The probability distribution is defined as
where
is the partition function.
The negative log-likelihood becomes
This framework appears in:
| Model type | Example |
|---|---|
| Boltzmann machines | Energy-based binary networks |
| Restricted Boltzmann machines | Layered stochastic networks |
| Contrastive energy models | Representation learning |
| Score-based diffusion models | Denoising objectives |
The partition function is often computationally expensive. Many energy-based methods therefore use approximations.
Likelihood in Variational Autoencoders
Variational autoencoders (VAEs) define a latent-variable model:
The likelihood of a data point is
This integral is usually intractable. VAEs therefore optimize a lower bound called the evidence lower bound (ELBO):
The first term is a reconstruction likelihood. The second term regularizes the latent distribution.
VAEs therefore combine likelihood maximization with probabilistic latent inference.
Likelihood and Diffusion Models
Diffusion models also use likelihood-related objectives, although the training procedure is indirect.
A diffusion process gradually adds noise:
The model learns the reverse process:
Training objectives can be derived from variational likelihood bounds. In practice, many diffusion models minimize denoising objectives equivalent to weighted likelihood optimization.
For example, a common diffusion loss is
Although this resembles mean squared error, it arises from probabilistic latent-variable modeling.
Numerical Stability
Likelihood objectives frequently involve logarithms and exponentials. Direct computation can become unstable.
For example, computing
naively may overflow when logits are large.
Stable implementations use the log-sum-exp trick:
where
This subtraction prevents exponential overflow.
PyTorch internally applies stable implementations for:
| Function | Stable implementation |
|---|---|
CrossEntropyLoss | log_softmax + NLLLoss |
BCEWithLogitsLoss | sigmoid + BCE |
logsumexp | stable exponential normalization |
These details are important because unstable probability computations can produce NaNs during training.
Likelihood and Calibration
Likelihood optimization encourages calibrated probabilities.
A calibrated classifier outputs probabilities consistent with empirical frequencies. For example, among predictions with confidence , roughly should be correct.
Cross-entropy strongly penalizes overconfident incorrect predictions. This often improves calibration relative to losses that ignore probability structure.
However, modern deep networks can still become miscalibrated, especially after extreme scaling or overtraining.
Calibration techniques include:
| Method | Purpose |
|---|---|
| Temperature scaling | Softens logits |
| Label smoothing | Reduces overconfidence |
| Ensembles | Improves uncertainty estimates |
| Bayesian inference | Models posterior uncertainty |
Likelihood-based training provides a natural framework for these methods.
Generative Modeling Objectives
Generative models often optimize likelihood directly or approximately.
Examples include:
| Model | Objective |
|---|---|
| Autoregressive transformer | Sequence likelihood |
| VAE | Variational lower bound |
| Normalizing flow | Exact likelihood |
| Diffusion model | Variational denoising objective |
| Energy-based model | Energy likelihood |
| GAN | Adversarial divergence objective |
Not all generative models maximize exact likelihood. GANs, for example, optimize an adversarial objective instead.
However, likelihood remains one of the central principles of probabilistic deep learning.
Likelihood Versus Distance-Based Losses
Likelihood objectives differ from purely geometric error measures.
Suppose two models produce the same prediction error magnitude. A likelihood-based model may still assign different confidence levels.
For example:
| Prediction | Confidence | Cross-entropy behavior |
|---|---|---|
| Correct, low confidence | Moderate loss | |
| Correct, high confidence | Small loss | |
| Wrong, high confidence | Very large loss |
This probabilistic structure makes likelihood-based objectives especially effective for classification and generative modeling.
Distance-based losses such as MSE measure only numerical difference. Likelihood-based losses model uncertainty and probability structure.
Practical Guidelines
Use likelihood-based objectives whenever the model predicts probabilities or distributions.
For regression with Gaussian assumptions, use MSE. For binary classification, use BCEWithLogitsLoss. For multiclass classification, use CrossEntropyLoss. For language modeling, use token-level cross-entropy. For probabilistic regression, predict both mean and variance. For generative models, derive objectives from the underlying probabilistic model whenever possible.
In modern deep learning, most major objectives are likelihood objectives in disguise. Cross-entropy, binary cross-entropy, negative log-likelihood, autoregressive token prediction, and many variational objectives all arise from the same principle: