Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.
Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.
A model with high bias makes strong simplifying assumptions. It tends to underfit. A model with high variance changes too much when the training data changes. It tends to overfit.
The practical goal is to find a model that is flexible enough to learn the true pattern but stable enough to generalize to unseen data.
Prediction Error
Assume there is an unknown relationship between input and target . A common regression model is
where is the true signal and is noise.
The learning algorithm sees a finite training set and produces an estimated function
The prediction error at a point is
This error has three conceptual sources:
| Source | Meaning |
|---|---|
| Bias | Error from wrong assumptions |
| Variance | Error from sensitivity to the training set |
| Irreducible noise | Randomness in the data itself |
The first two can be affected by model design and training. The third cannot be removed by a better model if it is truly random.
Bias
Bias measures how far the average learned model is from the true function.
If we trained many models on many different training sets from the same distribution, each model would learn a slightly different function. The average of those learned functions may still be far from the true function. That gap is bias.
A high-bias model is too rigid. It cannot represent the true relationship well.
Examples:
| Situation | Why bias is high |
|---|---|
| Linear model for nonlinear data | Model class is too simple |
| Very small neural network | Insufficient capacity |
| Excessive regularization | Model forced to be too smooth |
| Too few training epochs | Optimization stops too early |
| Poor feature representation | Important signal absent |
High bias usually causes high training error and high validation error.
Variance
Variance measures how much the learned model changes when the training set changes.
A high-variance model is too sensitive to details of the training data. It may fit real patterns, but it may also fit noise, rare examples, and accidental correlations.
Examples:
| Situation | Why variance is high |
|---|---|
| Very large model on small data | Too many degrees of freedom |
| Weak regularization | Model can fit noise |
| Training too long | Memorization increases |
| Noisy labels | Model learns label errors |
| Data leakage in validation | Selection becomes unstable |
High variance usually causes low training error and high validation error.
Bias-Variance Decomposition
For squared error regression, prediction error can be decomposed into bias, variance, and noise.
Let be the function learned from a random training set. Then:
Here is irreducible noise.
The bias term is
The variance term is
This decomposition is exact under standard squared-error assumptions. For classification and deep neural networks, the same intuition remains useful, but the algebra is less direct.
Underfitting and Overfitting
Underfitting occurs when the model cannot fit the training data well. It usually indicates high bias.
Overfitting occurs when the model fits training data much better than validation data. It usually indicates high variance.
| Pattern | Training loss | Validation loss | Likely problem |
|---|---|---|---|
| Underfitting | High | High | High bias |
| Good fit | Low | Low | Balanced |
| Overfitting | Very low | High | High variance |
| Data or split problem | Low | Unstable or misleading | Leakage, shift, noise |
In practice, training and validation curves are the first diagnostic tool.
A high training loss means the model has not learned the training set. A large gap between training and validation loss means the model has learned the training set more than the underlying pattern.
Model Capacity
Model capacity is the ability of a model class to fit a wide range of functions.
A linear model has limited capacity. A deep neural network with many layers and parameters has much greater capacity.
Increasing capacity usually decreases bias and increases variance.
| Change | Bias | Variance |
|---|---|---|
| Larger model | Decreases | Increases |
| More layers | Decreases | May increase |
| Stronger regularization | Increases | Decreases |
| More training data | Similar | Decreases |
| Better features | Decreases | May decrease |
| Early stopping | Increases | Decreases |
The old bias-variance tradeoff suggests that one must choose between bias and variance. Modern deep learning complicates this picture because very large models can sometimes generalize well, especially with enough data, regularization, and suitable optimization.
Bias and Variance in Deep Learning
Classical theory often assumes models become worse after a certain capacity because variance grows. Deep learning often behaves differently.
Large neural networks can have enough parameters to fit the training data perfectly and still generalize well. This happens in many overparameterized regimes.
Several factors help explain this behavior:
| Factor | Effect |
|---|---|
| Large datasets | Reduce variance |
| Stochastic gradient descent | Introduces implicit regularization |
| Architecture design | Encodes useful structure |
| Data augmentation | Expands effective training data |
| Weight decay | Limits parameter growth |
| Normalization | Stabilizes optimization |
| Early stopping | Prevents excessive fitting |
Even so, the bias-variance language remains useful. When training loss is too high, increase capacity or improve optimization. When validation loss is too high relative to training loss, improve regularization, data quality, or splitting.
Diagnosing High Bias
A model likely has high bias when:
| Symptom | Interpretation |
|---|---|
| Training loss remains high | Model cannot fit data |
| Validation loss remains high | Poor generalization follows poor fit |
| Both curves plateau early | Capacity or optimization limit |
| More training data does not help much | Model class may be wrong |
| Predictions are overly smooth | Model cannot represent detail |
Ways to reduce bias:
- Use a larger model.
- Train longer.
- Reduce excessive regularization.
- Use a better architecture.
- Improve input features or preprocessing.
- Use a more suitable loss function.
- Tune the learning rate and optimizer.
In PyTorch, a high-bias model may be a small MLP:
model = torch.nn.Sequential(
torch.nn.Linear(784, 32),
torch.nn.ReLU(),
torch.nn.Linear(32, 10),
)Increasing hidden width may reduce bias:
model = torch.nn.Sequential(
torch.nn.Linear(784, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 10),
)The second model has more capacity. It can represent more complex functions.
Diagnosing High Variance
A model likely has high variance when:
| Symptom | Interpretation |
|---|---|
| Training loss is very low | Model fits training data |
| Validation loss is much higher | Generalization gap |
| Validation metric fluctuates strongly | Model or dataset instability |
| Performance depends heavily on random seed | Training is sensitive |
| Small data changes alter results | Model depends on sample noise |
Ways to reduce variance:
- Add more training data.
- Use data augmentation.
- Increase weight decay.
- Add dropout.
- Use early stopping.
- Reduce model capacity.
- Improve label quality.
- Use ensembling.
- Use a better train-validation split.
Example with dropout and weight decay:
model = torch.nn.Sequential(
torch.nn.Linear(784, 512),
torch.nn.ReLU(),
torch.nn.Dropout(p=0.3),
torch.nn.Linear(512, 512),
torch.nn.ReLU(),
torch.nn.Dropout(p=0.3),
torch.nn.Linear(512, 10),
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
weight_decay=0.01,
)Dropout injects noise into hidden activations during training. Weight decay discourages large parameter values. Both can reduce overfitting.
Learning Curves
Learning curves plot performance as the amount of training data increases.
They help distinguish bias from variance.
A high-bias model usually has training and validation errors close together, both at poor values. Adding more data often gives little improvement.
A high-variance model usually has low training error and much higher validation error. Adding more data often helps.
| Learning curve pattern | Diagnosis |
|---|---|
| High train error, high validation error | High bias |
| Low train error, high validation error | High variance |
| Both improve with more data | Data helps |
| Validation error stops improving | Model, data, or objective limit |
Learning curves are useful because they show whether more data is likely to help. If the model already underfits, collecting more data may be less useful than improving the model.
Validation Gap
The validation gap is the difference between validation error and training error.
For losses:
For accuracy:
A small gap with poor performance suggests underfitting. A large gap suggests overfitting.
In PyTorch-style training logs:
epoch 1: train_loss=1.95 val_loss=1.92
epoch 2: train_loss=1.70 val_loss=1.69
epoch 3: train_loss=1.55 val_loss=1.58This looks like reasonable learning.
epoch 1: train_loss=1.30 val_loss=1.65
epoch 2: train_loss=0.72 val_loss=1.80
epoch 3: train_loss=0.25 val_loss=2.10This suggests overfitting.
Irreducible Error
Some error remains even with the best possible model.
Sources include:
| Source | Example |
|---|---|
| Measurement noise | Sensor error |
| Label ambiguity | Multiple valid labels |
| Hidden variables | Missing causal factors |
| Randomness | Truly stochastic outcomes |
| Human disagreement | Different annotators choose different labels |
If two expert annotators disagree on a medical image, the model may have no single target that is always correct.
Irreducible error sets a ceiling on performance. The right response may be better data collection, better labels, or probabilistic prediction rather than a larger model.
Practical Checklist
When performance is poor, inspect the training and validation metrics.
| Observation | Likely action |
|---|---|
| Training loss high | Increase capacity, train longer, tune optimizer |
| Training and validation both poor | Reduce bias |
| Training good, validation poor | Reduce variance |
| Validation unstable | Use more data, stratified split, repeated runs |
| Test worse than validation | Check test shift or validation overuse |
| All metrics poor despite large model | Inspect labels, preprocessing, objective |
Bias and variance are diagnostic tools. They do not replace error analysis, but they give a useful first map of the problem.
Summary
Bias is error from overly restrictive assumptions. Variance is error from excessive sensitivity to the training set.
High bias leads to underfitting. High variance leads to overfitting. Model capacity, data size, regularization, optimization, and architecture all affect the balance.
In deep learning, the classical tradeoff is modified by overparameterization, large datasets, and implicit regularization. The practical method remains the same: compare training and validation behavior, identify the dominant failure mode, and change the model or data accordingly.