Bias and Variance

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

A model with high bias makes strong simplifying assumptions. It tends to underfit. A model with high variance changes too much when the training data changes. It tends to overfit.

The practical goal is to find a model that is flexible enough to learn the true pattern but stable enough to generalize to unseen data.

Prediction Error

Assume there is an unknown relationship between input $x$ and target $y$ . A common regression model is

y = f(x) + \epsilon,

where $f(x)$ is the true signal and $\epsilon$ is noise.

The learning algorithm sees a finite training set and produces an estimated function

\hat{f}(x).

The prediction error at a point $x$ is

\mathbb{E}\left[(y-\hat{f}(x))^2\right].

This error has three conceptual sources:

Source	Meaning
Bias	Error from wrong assumptions
Variance	Error from sensitivity to the training set
Irreducible noise	Randomness in the data itself

The first two can be affected by model design and training. The third cannot be removed by a better model if it is truly random.

Bias

Bias measures how far the average learned model is from the true function.

If we trained many models on many different training sets from the same distribution, each model would learn a slightly different function. The average of those learned functions may still be far from the true function. That gap is bias.

A high-bias model is too rigid. It cannot represent the true relationship well.

Examples:

Situation	Why bias is high
Linear model for nonlinear data	Model class is too simple
Very small neural network	Insufficient capacity
Excessive regularization	Model forced to be too smooth
Too few training epochs	Optimization stops too early
Poor feature representation	Important signal absent

High bias usually causes high training error and high validation error.

Variance

Variance measures how much the learned model changes when the training set changes.

A high-variance model is too sensitive to details of the training data. It may fit real patterns, but it may also fit noise, rare examples, and accidental correlations.

Examples:

Situation	Why variance is high
Very large model on small data	Too many degrees of freedom
Weak regularization	Model can fit noise
Training too long	Memorization increases
Noisy labels	Model learns label errors
Data leakage in validation	Selection becomes unstable

High variance usually causes low training error and high validation error.

Bias-Variance Decomposition

For squared error regression, prediction error can be decomposed into bias, variance, and noise.

Let $\hat{f}(x)$ be the function learned from a random training set. Then:

\mathbb{E}\left[(y-\hat{f}(x))^2\right] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2.

Here $\sigma^2$ is irreducible noise.

The bias term is

\text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x).

The variance term is

\text{Var}[\hat{f}(x)] = \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right].

This decomposition is exact under standard squared-error assumptions. For classification and deep neural networks, the same intuition remains useful, but the algebra is less direct.

Underfitting and Overfitting

Underfitting occurs when the model cannot fit the training data well. It usually indicates high bias.

Overfitting occurs when the model fits training data much better than validation data. It usually indicates high variance.

Pattern	Training loss	Validation loss	Likely problem
Underfitting	High	High	High bias
Good fit	Low	Low	Balanced
Overfitting	Very low	High	High variance
Data or split problem	Low	Unstable or misleading	Leakage, shift, noise

In practice, training and validation curves are the first diagnostic tool.

A high training loss means the model has not learned the training set. A large gap between training and validation loss means the model has learned the training set more than the underlying pattern.

Model Capacity

Model capacity is the ability of a model class to fit a wide range of functions.

A linear model has limited capacity. A deep neural network with many layers and parameters has much greater capacity.

Increasing capacity usually decreases bias and increases variance.

Change	Bias	Variance
Larger model	Decreases	Increases
More layers	Decreases	May increase
Stronger regularization	Increases	Decreases
More training data	Similar	Decreases
Better features	Decreases	May decrease
Early stopping	Increases	Decreases

The old bias-variance tradeoff suggests that one must choose between bias and variance. Modern deep learning complicates this picture because very large models can sometimes generalize well, especially with enough data, regularization, and suitable optimization.

Bias and Variance in Deep Learning

Classical theory often assumes models become worse after a certain capacity because variance grows. Deep learning often behaves differently.

Large neural networks can have enough parameters to fit the training data perfectly and still generalize well. This happens in many overparameterized regimes.

Several factors help explain this behavior:

Factor	Effect
Large datasets	Reduce variance
Stochastic gradient descent	Introduces implicit regularization
Architecture design	Encodes useful structure
Data augmentation	Expands effective training data
Weight decay	Limits parameter growth
Normalization	Stabilizes optimization
Early stopping	Prevents excessive fitting

Even so, the bias-variance language remains useful. When training loss is too high, increase capacity or improve optimization. When validation loss is too high relative to training loss, improve regularization, data quality, or splitting.

Diagnosing High Bias

A model likely has high bias when:

Symptom	Interpretation
Training loss remains high	Model cannot fit data
Validation loss remains high	Poor generalization follows poor fit
Both curves plateau early	Capacity or optimization limit
More training data does not help much	Model class may be wrong
Predictions are overly smooth	Model cannot represent detail

Ways to reduce bias:

Use a larger model.
Train longer.
Reduce excessive regularization.
Use a better architecture.
Improve input features or preprocessing.
Use a more suitable loss function.
Tune the learning rate and optimizer.

In PyTorch, a high-bias model may be a small MLP:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 10),
)

Increasing hidden width may reduce bias:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 10),
)

The second model has more capacity. It can represent more complex functions.

Diagnosing High Variance

A model likely has high variance when:

Symptom	Interpretation
Training loss is very low	Model fits training data
Validation loss is much higher	Generalization gap
Validation metric fluctuates strongly	Model or dataset instability
Performance depends heavily on random seed	Training is sensitive
Small data changes alter results	Model depends on sample noise

Ways to reduce variance:

Add more training data.
Use data augmentation.
Increase weight decay.
Add dropout.
Use early stopping.
Reduce model capacity.
Improve label quality.
Use ensembling.
Use a better train-validation split.

Example with dropout and weight decay:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=0.01,
)

Dropout injects noise into hidden activations during training. Weight decay discourages large parameter values. Both can reduce overfitting.

Learning Curves

Learning curves plot performance as the amount of training data increases.

They help distinguish bias from variance.

A high-bias model usually has training and validation errors close together, both at poor values. Adding more data often gives little improvement.

A high-variance model usually has low training error and much higher validation error. Adding more data often helps.

Learning curve pattern	Diagnosis
High train error, high validation error	High bias
Low train error, high validation error	High variance
Both improve with more data	Data helps
Validation error stops improving	Model, data, or objective limit

Learning curves are useful because they show whether more data is likely to help. If the model already underfits, collecting more data may be less useful than improving the model.

Validation Gap

The validation gap is the difference between validation error and training error.

For losses:

\text{gap} = L_{\text{val}} - L_{\text{train}}.

For accuracy:

\text{gap} = \text{Acc}_{\text{train}} - \text{Acc}_{\text{val}}.

A small gap with poor performance suggests underfitting. A large gap suggests overfitting.

In PyTorch-style training logs:

epoch 1: train_loss=1.95 val_loss=1.92
epoch 2: train_loss=1.70 val_loss=1.69
epoch 3: train_loss=1.55 val_loss=1.58

This looks like reasonable learning.

epoch 1: train_loss=1.30 val_loss=1.65
epoch 2: train_loss=0.72 val_loss=1.80
epoch 3: train_loss=0.25 val_loss=2.10

This suggests overfitting.

Irreducible Error

Some error remains even with the best possible model.

Sources include:

Source	Example
Measurement noise	Sensor error
Label ambiguity	Multiple valid labels
Hidden variables	Missing causal factors
Randomness	Truly stochastic outcomes
Human disagreement	Different annotators choose different labels

If two expert annotators disagree on a medical image, the model may have no single target that is always correct.

Irreducible error sets a ceiling on performance. The right response may be better data collection, better labels, or probabilistic prediction rather than a larger model.

Practical Checklist

When performance is poor, inspect the training and validation metrics.

Observation	Likely action
Training loss high	Increase capacity, train longer, tune optimizer
Training and validation both poor	Reduce bias
Training good, validation poor	Reduce variance
Validation unstable	Use more data, stratified split, repeated runs
Test worse than validation	Check test shift or validation overuse
All metrics poor despite large model	Inspect labels, preprocessing, objective

Bias and variance are diagnostic tools. They do not replace error analysis, but they give a useful first map of the problem.

Summary

Bias is error from overly restrictive assumptions. Variance is error from excessive sensitivity to the training set.

High bias leads to underfitting. High variance leads to overfitting. Model capacity, data size, regularization, optimization, and architecture all affect the balance.

In deep learning, the classical tradeoff is modified by overparameterization, large datasets, and implicit regularization. The practical method remains the same: compare training and validation behavior, identify the dominant failure mode, and change the model or data accordingly.