Overfitting and underfitting describe two common ways a model can fail.
Overfitting and underfitting describe two common ways a model can fail. A model underfits when it learns too little from the training data. A model overfits when it learns the training data too specifically and performs poorly on new data.
The goal is to find a model that captures the stable patterns in the data without memorizing accidental details.
The Central Problem
During training, a model minimizes loss on the training set:
But the real objective is good performance on unseen data:
The training loss is directly optimized. The test loss is only estimated through validation and test sets.
This creates the central tension. A model can reduce training loss by learning true structure, but it can also reduce training loss by memorizing noise, outliers, duplicate examples, or dataset artifacts.
Underfitting
Underfitting occurs when the model is too simple, poorly optimized, or incorrectly specified. It cannot represent the relationship between inputs and targets.
Common symptoms:
| Symptom | Meaning |
|---|---|
| Training loss is high | Model cannot fit training data |
| Validation loss is high | Poor fit carries over to unseen data |
| Training and validation curves are close | Model fails similarly on both |
| Predictions are too simple | Model misses important variation |
Example: fitting a straight line to strongly nonlinear data. The model may find the best possible line, but the line still cannot express the true pattern.
In deep learning, underfitting may occur when a network has too few layers, too few hidden units, excessive regularization, poor features, a bad learning rate, or insufficient training time.
Overfitting
Overfitting occurs when the model fits the training data too closely. It learns patterns that do not generalize.
Common symptoms:
| Symptom | Meaning |
|---|---|
| Training loss is very low | Model fits training examples |
| Validation loss is much higher | Generalization gap |
| Validation loss starts increasing | Model begins fitting noise |
| Performance depends heavily on split or seed | Model is unstable |
A high-capacity model can memorize labels, rare examples, and spurious correlations. For example, an image classifier may learn background patterns rather than object shape. A medical model may learn scanner-specific artifacts rather than disease features.
Training and Validation Curves
Training and validation curves are the simplest diagnostic tool.
A healthy training run often shows both training and validation loss decreasing at first. Later, validation loss may stop improving while training loss continues to fall.
Typical patterns:
| Pattern | Diagnosis |
|---|---|
| High train loss, high validation loss | Underfitting |
| Low train loss, high validation loss | Overfitting |
| Low train loss, low validation loss | Good fit |
| Noisy validation loss | Small validation set, unstable training, or high variance |
For classification, the same idea applies to accuracy:
| Pattern | Diagnosis |
|---|---|
| Low train accuracy, low validation accuracy | Underfitting |
| High train accuracy, low validation accuracy | Overfitting |
| High train accuracy, high validation accuracy | Good fit |
Causes of Underfitting
Underfitting usually means the model, training process, or input representation lacks enough useful capacity.
Common causes:
| Cause | Example |
|---|---|
| Model too small | Tiny MLP for image recognition |
| Training too short | Too few epochs |
| Learning rate too low | Optimization barely moves |
| Learning rate too high | Optimization fails to settle |
| Excessive regularization | Too much dropout or weight decay |
| Poor features | Important input signal missing |
| Wrong architecture | Linear model for structured sequence data |
| Poor loss function | Objective mismatched to task |
A model can underfit even if it has many parameters. Bad optimization, broken preprocessing, or incorrect targets can keep a large model from learning.
Causes of Overfitting
Overfitting occurs when the model has enough flexibility to fit unstable details in the training set.
Common causes:
| Cause | Example |
|---|---|
| Too little data | Large model trained on small dataset |
| Model too large | Excess capacity for the task |
| Weak regularization | No dropout or weight decay |
| Training too long | Memorization after useful learning |
| Noisy labels | Model learns label mistakes |
| Data leakage | Validation score becomes misleading |
| Duplicate examples | Same sample appears across splits |
| Spurious correlations | Background predicts class in training data |
Overfitting is often a data problem as much as a model problem. More data, cleaner labels, better splits, and stronger augmentation can matter more than changing the architecture.
Reducing Underfitting
To reduce underfitting, make the learning problem easier for the model or make the model more capable.
Useful interventions:
| Intervention | Effect |
|---|---|
| Increase model capacity | Allows richer functions |
| Train longer | Gives optimizer more time |
| Tune learning rate | Improves optimization |
| Reduce regularization | Allows closer fit |
| Improve preprocessing | Exposes useful signal |
| Use a better architecture | Adds the right inductive bias |
| Use pretrained models | Starts from useful representations |
PyTorch example: increasing capacity.
import torch.nn as nn
small_model = nn.Sequential(
nn.Linear(784, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
larger_model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)The larger model can represent more complex decision boundaries. This may reduce underfitting, but it may also increase overfitting if data is limited.
Reducing Overfitting
To reduce overfitting, make the model less sensitive to accidental training-set details.
Useful interventions:
| Intervention | Effect |
|---|---|
| Add more data | Reduces variance |
| Use data augmentation | Expands effective data |
| Add weight decay | Penalizes large weights |
| Add dropout | Reduces co-adaptation |
| Use early stopping | Stops before memorization dominates |
| Reduce model capacity | Limits memorization |
| Improve split quality | Prevents leakage |
| Clean labels | Removes misleading targets |
| Use ensembling | Averages unstable models |
PyTorch example: dropout and weight decay.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Dropout(p=0.3),
nn.Linear(512, 512),
nn.ReLU(),
nn.Dropout(p=0.3),
nn.Linear(512, 10),
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
weight_decay=1e-2,
)Dropout randomly removes activations during training. Weight decay penalizes large parameter values. Both can improve generalization.
Early Stopping
Early stopping is a simple and effective way to reduce overfitting.
The training procedure tracks validation loss after each epoch. If validation loss stops improving, training stops. The best checkpoint is usually the one with the lowest validation loss.
best_val_loss = float("inf")
patience = 5
bad_epochs = 0
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss = evaluate(model, val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
bad_epochs = 0
best_state = {
k: v.detach().cpu().clone()
for k, v in model.state_dict().items()
}
else:
bad_epochs += 1
if bad_epochs >= patience:
break
model.load_state_dict(best_state)Early stopping uses the validation set for model selection. The test set should remain untouched until the final evaluation.
Regularization
Regularization refers to methods that improve generalization by restricting or stabilizing the learned function.
Common regularizers include:
| Method | Mechanism |
|---|---|
| Weight decay | Penalizes large weights |
| Dropout | Randomly removes activations |
| Data augmentation | Trains on transformed examples |
| Label smoothing | Softens hard labels |
| Mixup | Blends examples and labels |
| Stochastic depth | Randomly drops layers |
| Noise injection | Adds noise to inputs or activations |
Regularization can reduce overfitting, but too much regularization can cause underfitting. For example, very high dropout may prevent the model from fitting even the training set.
Data Augmentation
Data augmentation creates modified versions of training examples while preserving the label.
For images, common augmentations include:
| Augmentation | Meaning |
|---|---|
| Random crop | Changes framing |
| Horizontal flip | Mirrors image |
| Color jitter | Changes brightness or contrast |
| Rotation | Changes orientation |
| Cutout | Masks image regions |
| Mixup | Blends two images |
For text, augmentation is more delicate because small changes can alter meaning. For audio, augmentation may include noise injection, speed changes, and time masking.
Data augmentation teaches the model invariance. If a cat image is still a cat after cropping or color changes, the model should produce the same label.
Capacity Control
Capacity control means adjusting how flexible the model is.
A small model may underfit. A large model may overfit. The right capacity depends on data size, noise level, task difficulty, and regularization.
In classical machine learning, capacity is often controlled by choosing a model class. In deep learning, capacity is controlled by architecture and training choices:
| Control | Example |
|---|---|
| Width | Number of hidden units |
| Depth | Number of layers |
| Parameter sharing | Convolutions, recurrent weights |
| Sparsity | Mixture-of-experts routing |
| Regularization | Dropout, weight decay |
| Training duration | Early stopping |
Parameter sharing is especially important. Convolutional layers can generalize better than fully connected layers on images because they encode translation structure.
Double Descent
Classical bias-variance theory suggests that test error decreases at first, then increases as model capacity becomes too large. Modern deep learning often shows a different pattern called double descent.
As model capacity increases:
- Test error decreases.
- Test error increases near the interpolation threshold.
- Test error decreases again for highly overparameterized models.
The interpolation threshold is the point where the model can fit the training data nearly perfectly.
Double descent helps explain why very large neural networks can generalize well despite having enough parameters to memorize the training set. It does not mean larger models always perform better. Data quality, optimization, regularization, and architecture still matter.
Practical Diagnosis
A practical workflow:
| Observation | Diagnosis | Possible action |
|---|---|---|
| Train loss high, val loss high | Underfitting | Larger model, train longer, tune optimizer |
| Train loss low, val loss high | Overfitting | More data, regularization, augmentation |
| Train loss decreasing, val loss rising | Overfitting during training | Early stopping |
| Train and val both unstable | Optimization or data issue | Lower learning rate, inspect data |
| Test much worse than val | Validation overuse or distribution shift | Rebuild split, audit leakage |
| Train loss fails immediately | Implementation bug | Check labels, shapes, loss, dtype |
The first response to poor performance should be measurement, not guessing. Plot curves. Inspect examples. Compare train and validation metrics. Check that the split matches the deployment setting.
PyTorch Evaluation Pattern
A reliable evaluation function avoids training-mode behavior and disables gradient tracking.
import torch
def evaluate(model, dataloader, loss_fn, device):
model.eval()
total_loss = 0.0
total_examples = 0
correct = 0
with torch.no_grad():
for x, y in dataloader:
x = x.to(device)
y = y.to(device)
logits = model(x)
loss = loss_fn(logits, y)
batch_size = x.size(0)
total_loss += loss.item() * batch_size
total_examples += batch_size
pred = logits.argmax(dim=-1)
correct += (pred == y).sum().item()
avg_loss = total_loss / total_examples
accuracy = correct / total_examples
return avg_loss, accuracyThe call model.eval() matters. Dropout and batch normalization behave differently during training and evaluation. Forgetting this call can produce misleading validation metrics.
Summary
Underfitting means the model learns too little. Training and validation performance are both poor. Overfitting means the model learns the training set too specifically. Training performance is strong, but validation performance is weak.
Reducing underfitting usually requires more capacity, better optimization, better features, or less regularization. Reducing overfitting usually requires more data, stronger regularization, better augmentation, early stopping, cleaner labels, or improved data splits.
Training and validation curves give the clearest first diagnosis.