Skip to content

Bias and Variance

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

A model with high bias makes strong simplifying assumptions. It tends to underfit. A model with high variance changes too much when the training data changes. It tends to overfit.

The practical goal is to find a model that is flexible enough to learn the true pattern but stable enough to generalize to unseen data.

Prediction Error

Assume there is an unknown relationship between input xx and target yy. A common regression model is

y=f(x)+ϵ, y = f(x) + \epsilon,

where f(x)f(x) is the true signal and ϵ\epsilon is noise.

The learning algorithm sees a finite training set and produces an estimated function

f^(x). \hat{f}(x).

The prediction error at a point xx is

E[(yf^(x))2]. \mathbb{E}\left[(y-\hat{f}(x))^2\right].

This error has three conceptual sources:

SourceMeaning
BiasError from wrong assumptions
VarianceError from sensitivity to the training set
Irreducible noiseRandomness in the data itself

The first two can be affected by model design and training. The third cannot be removed by a better model if it is truly random.

Bias

Bias measures how far the average learned model is from the true function.

If we trained many models on many different training sets from the same distribution, each model would learn a slightly different function. The average of those learned functions may still be far from the true function. That gap is bias.

A high-bias model is too rigid. It cannot represent the true relationship well.

Examples:

SituationWhy bias is high
Linear model for nonlinear dataModel class is too simple
Very small neural networkInsufficient capacity
Excessive regularizationModel forced to be too smooth
Too few training epochsOptimization stops too early
Poor feature representationImportant signal absent

High bias usually causes high training error and high validation error.

Variance

Variance measures how much the learned model changes when the training set changes.

A high-variance model is too sensitive to details of the training data. It may fit real patterns, but it may also fit noise, rare examples, and accidental correlations.

Examples:

SituationWhy variance is high
Very large model on small dataToo many degrees of freedom
Weak regularizationModel can fit noise
Training too longMemorization increases
Noisy labelsModel learns label errors
Data leakage in validationSelection becomes unstable

High variance usually causes low training error and high validation error.

Bias-Variance Decomposition

For squared error regression, prediction error can be decomposed into bias, variance, and noise.

Let f^(x)\hat{f}(x) be the function learned from a random training set. Then:

E[(yf^(x))2]=Bias[f^(x)]2+Var[f^(x)]+σ2. \mathbb{E}\left[(y-\hat{f}(x))^2\right] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2.

Here σ2\sigma^2 is irreducible noise.

The bias term is

Bias[f^(x)]=E[f^(x)]f(x). \text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x).

The variance term is

Var[f^(x)]=E[(f^(x)E[f^(x)])2]. \text{Var}[\hat{f}(x)] = \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right].

This decomposition is exact under standard squared-error assumptions. For classification and deep neural networks, the same intuition remains useful, but the algebra is less direct.

Underfitting and Overfitting

Underfitting occurs when the model cannot fit the training data well. It usually indicates high bias.

Overfitting occurs when the model fits training data much better than validation data. It usually indicates high variance.

PatternTraining lossValidation lossLikely problem
UnderfittingHighHighHigh bias
Good fitLowLowBalanced
OverfittingVery lowHighHigh variance
Data or split problemLowUnstable or misleadingLeakage, shift, noise

In practice, training and validation curves are the first diagnostic tool.

A high training loss means the model has not learned the training set. A large gap between training and validation loss means the model has learned the training set more than the underlying pattern.

Model Capacity

Model capacity is the ability of a model class to fit a wide range of functions.

A linear model has limited capacity. A deep neural network with many layers and parameters has much greater capacity.

Increasing capacity usually decreases bias and increases variance.

ChangeBiasVariance
Larger modelDecreasesIncreases
More layersDecreasesMay increase
Stronger regularizationIncreasesDecreases
More training dataSimilarDecreases
Better featuresDecreasesMay decrease
Early stoppingIncreasesDecreases

The old bias-variance tradeoff suggests that one must choose between bias and variance. Modern deep learning complicates this picture because very large models can sometimes generalize well, especially with enough data, regularization, and suitable optimization.

Bias and Variance in Deep Learning

Classical theory often assumes models become worse after a certain capacity because variance grows. Deep learning often behaves differently.

Large neural networks can have enough parameters to fit the training data perfectly and still generalize well. This happens in many overparameterized regimes.

Several factors help explain this behavior:

FactorEffect
Large datasetsReduce variance
Stochastic gradient descentIntroduces implicit regularization
Architecture designEncodes useful structure
Data augmentationExpands effective training data
Weight decayLimits parameter growth
NormalizationStabilizes optimization
Early stoppingPrevents excessive fitting

Even so, the bias-variance language remains useful. When training loss is too high, increase capacity or improve optimization. When validation loss is too high relative to training loss, improve regularization, data quality, or splitting.

Diagnosing High Bias

A model likely has high bias when:

SymptomInterpretation
Training loss remains highModel cannot fit data
Validation loss remains highPoor generalization follows poor fit
Both curves plateau earlyCapacity or optimization limit
More training data does not help muchModel class may be wrong
Predictions are overly smoothModel cannot represent detail

Ways to reduce bias:

  1. Use a larger model.
  2. Train longer.
  3. Reduce excessive regularization.
  4. Use a better architecture.
  5. Improve input features or preprocessing.
  6. Use a more suitable loss function.
  7. Tune the learning rate and optimizer.

In PyTorch, a high-bias model may be a small MLP:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 10),
)

Increasing hidden width may reduce bias:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 10),
)

The second model has more capacity. It can represent more complex functions.

Diagnosing High Variance

A model likely has high variance when:

SymptomInterpretation
Training loss is very lowModel fits training data
Validation loss is much higherGeneralization gap
Validation metric fluctuates stronglyModel or dataset instability
Performance depends heavily on random seedTraining is sensitive
Small data changes alter resultsModel depends on sample noise

Ways to reduce variance:

  1. Add more training data.
  2. Use data augmentation.
  3. Increase weight decay.
  4. Add dropout.
  5. Use early stopping.
  6. Reduce model capacity.
  7. Improve label quality.
  8. Use ensembling.
  9. Use a better train-validation split.

Example with dropout and weight decay:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=0.01,
)

Dropout injects noise into hidden activations during training. Weight decay discourages large parameter values. Both can reduce overfitting.

Learning Curves

Learning curves plot performance as the amount of training data increases.

They help distinguish bias from variance.

A high-bias model usually has training and validation errors close together, both at poor values. Adding more data often gives little improvement.

A high-variance model usually has low training error and much higher validation error. Adding more data often helps.

Learning curve patternDiagnosis
High train error, high validation errorHigh bias
Low train error, high validation errorHigh variance
Both improve with more dataData helps
Validation error stops improvingModel, data, or objective limit

Learning curves are useful because they show whether more data is likely to help. If the model already underfits, collecting more data may be less useful than improving the model.

Validation Gap

The validation gap is the difference between validation error and training error.

For losses:

gap=LvalLtrain. \text{gap} = L_{\text{val}} - L_{\text{train}}.

For accuracy:

gap=AcctrainAccval. \text{gap} = \text{Acc}_{\text{train}} - \text{Acc}_{\text{val}}.

A small gap with poor performance suggests underfitting. A large gap suggests overfitting.

In PyTorch-style training logs:

epoch 1: train_loss=1.95 val_loss=1.92
epoch 2: train_loss=1.70 val_loss=1.69
epoch 3: train_loss=1.55 val_loss=1.58

This looks like reasonable learning.

epoch 1: train_loss=1.30 val_loss=1.65
epoch 2: train_loss=0.72 val_loss=1.80
epoch 3: train_loss=0.25 val_loss=2.10

This suggests overfitting.

Irreducible Error

Some error remains even with the best possible model.

Sources include:

SourceExample
Measurement noiseSensor error
Label ambiguityMultiple valid labels
Hidden variablesMissing causal factors
RandomnessTruly stochastic outcomes
Human disagreementDifferent annotators choose different labels

If two expert annotators disagree on a medical image, the model may have no single target that is always correct.

Irreducible error sets a ceiling on performance. The right response may be better data collection, better labels, or probabilistic prediction rather than a larger model.

Practical Checklist

When performance is poor, inspect the training and validation metrics.

ObservationLikely action
Training loss highIncrease capacity, train longer, tune optimizer
Training and validation both poorReduce bias
Training good, validation poorReduce variance
Validation unstableUse more data, stratified split, repeated runs
Test worse than validationCheck test shift or validation overuse
All metrics poor despite large modelInspect labels, preprocessing, objective

Bias and variance are diagnostic tools. They do not replace error analysis, but they give a useful first map of the problem.

Summary

Bias is error from overly restrictive assumptions. Variance is error from excessive sensitivity to the training set.

High bias leads to underfitting. High variance leads to overfitting. Model capacity, data size, regularization, optimization, and architecture all affect the balance.

In deep learning, the classical tradeoff is modified by overparameterization, large datasets, and implicit regularization. The practical method remains the same: compare training and validation behavior, identify the dominant failure mode, and change the model or data accordingly.