Skip to content

Overfitting and Underfitting

Overfitting and underfitting describe two common ways a model can fail.

Overfitting and underfitting describe two common ways a model can fail. A model underfits when it learns too little from the training data. A model overfits when it learns the training data too specifically and performs poorly on new data.

The goal is to find a model that captures the stable patterns in the data without memorizing accidental details.

The Central Problem

During training, a model minimizes loss on the training set:

Ltrain(θ). L_{\text{train}}(\theta).

But the real objective is good performance on unseen data:

Ltest(θ). L_{\text{test}}(\theta).

The training loss is directly optimized. The test loss is only estimated through validation and test sets.

This creates the central tension. A model can reduce training loss by learning true structure, but it can also reduce training loss by memorizing noise, outliers, duplicate examples, or dataset artifacts.

Underfitting

Underfitting occurs when the model is too simple, poorly optimized, or incorrectly specified. It cannot represent the relationship between inputs and targets.

Common symptoms:

SymptomMeaning
Training loss is highModel cannot fit training data
Validation loss is highPoor fit carries over to unseen data
Training and validation curves are closeModel fails similarly on both
Predictions are too simpleModel misses important variation

Example: fitting a straight line to strongly nonlinear data. The model may find the best possible line, but the line still cannot express the true pattern.

In deep learning, underfitting may occur when a network has too few layers, too few hidden units, excessive regularization, poor features, a bad learning rate, or insufficient training time.

Overfitting

Overfitting occurs when the model fits the training data too closely. It learns patterns that do not generalize.

Common symptoms:

SymptomMeaning
Training loss is very lowModel fits training examples
Validation loss is much higherGeneralization gap
Validation loss starts increasingModel begins fitting noise
Performance depends heavily on split or seedModel is unstable

A high-capacity model can memorize labels, rare examples, and spurious correlations. For example, an image classifier may learn background patterns rather than object shape. A medical model may learn scanner-specific artifacts rather than disease features.

Training and Validation Curves

Training and validation curves are the simplest diagnostic tool.

A healthy training run often shows both training and validation loss decreasing at first. Later, validation loss may stop improving while training loss continues to fall.

Typical patterns:

PatternDiagnosis
High train loss, high validation lossUnderfitting
Low train loss, high validation lossOverfitting
Low train loss, low validation lossGood fit
Noisy validation lossSmall validation set, unstable training, or high variance

For classification, the same idea applies to accuracy:

PatternDiagnosis
Low train accuracy, low validation accuracyUnderfitting
High train accuracy, low validation accuracyOverfitting
High train accuracy, high validation accuracyGood fit

Causes of Underfitting

Underfitting usually means the model, training process, or input representation lacks enough useful capacity.

Common causes:

CauseExample
Model too smallTiny MLP for image recognition
Training too shortToo few epochs
Learning rate too lowOptimization barely moves
Learning rate too highOptimization fails to settle
Excessive regularizationToo much dropout or weight decay
Poor featuresImportant input signal missing
Wrong architectureLinear model for structured sequence data
Poor loss functionObjective mismatched to task

A model can underfit even if it has many parameters. Bad optimization, broken preprocessing, or incorrect targets can keep a large model from learning.

Causes of Overfitting

Overfitting occurs when the model has enough flexibility to fit unstable details in the training set.

Common causes:

CauseExample
Too little dataLarge model trained on small dataset
Model too largeExcess capacity for the task
Weak regularizationNo dropout or weight decay
Training too longMemorization after useful learning
Noisy labelsModel learns label mistakes
Data leakageValidation score becomes misleading
Duplicate examplesSame sample appears across splits
Spurious correlationsBackground predicts class in training data

Overfitting is often a data problem as much as a model problem. More data, cleaner labels, better splits, and stronger augmentation can matter more than changing the architecture.

Reducing Underfitting

To reduce underfitting, make the learning problem easier for the model or make the model more capable.

Useful interventions:

InterventionEffect
Increase model capacityAllows richer functions
Train longerGives optimizer more time
Tune learning rateImproves optimization
Reduce regularizationAllows closer fit
Improve preprocessingExposes useful signal
Use a better architectureAdds the right inductive bias
Use pretrained modelsStarts from useful representations

PyTorch example: increasing capacity.

import torch.nn as nn

small_model = nn.Sequential(
    nn.Linear(784, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

larger_model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Linear(512, 10),
)

The larger model can represent more complex decision boundaries. This may reduce underfitting, but it may also increase overfitting if data is limited.

Reducing Overfitting

To reduce overfitting, make the model less sensitive to accidental training-set details.

Useful interventions:

InterventionEffect
Add more dataReduces variance
Use data augmentationExpands effective data
Add weight decayPenalizes large weights
Add dropoutReduces co-adaptation
Use early stoppingStops before memorization dominates
Reduce model capacityLimits memorization
Improve split qualityPrevents leakage
Clean labelsRemoves misleading targets
Use ensemblingAverages unstable models

PyTorch example: dropout and weight decay.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-2,
)

Dropout randomly removes activations during training. Weight decay penalizes large parameter values. Both can improve generalization.

Early Stopping

Early stopping is a simple and effective way to reduce overfitting.

The training procedure tracks validation loss after each epoch. If validation loss stops improving, training stops. The best checkpoint is usually the one with the lowest validation loss.

best_val_loss = float("inf")
patience = 5
bad_epochs = 0

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        bad_epochs = 0
        best_state = {
            k: v.detach().cpu().clone()
            for k, v in model.state_dict().items()
        }
    else:
        bad_epochs += 1

    if bad_epochs >= patience:
        break

model.load_state_dict(best_state)

Early stopping uses the validation set for model selection. The test set should remain untouched until the final evaluation.

Regularization

Regularization refers to methods that improve generalization by restricting or stabilizing the learned function.

Common regularizers include:

MethodMechanism
Weight decayPenalizes large weights
DropoutRandomly removes activations
Data augmentationTrains on transformed examples
Label smoothingSoftens hard labels
MixupBlends examples and labels
Stochastic depthRandomly drops layers
Noise injectionAdds noise to inputs or activations

Regularization can reduce overfitting, but too much regularization can cause underfitting. For example, very high dropout may prevent the model from fitting even the training set.

Data Augmentation

Data augmentation creates modified versions of training examples while preserving the label.

For images, common augmentations include:

AugmentationMeaning
Random cropChanges framing
Horizontal flipMirrors image
Color jitterChanges brightness or contrast
RotationChanges orientation
CutoutMasks image regions
MixupBlends two images

For text, augmentation is more delicate because small changes can alter meaning. For audio, augmentation may include noise injection, speed changes, and time masking.

Data augmentation teaches the model invariance. If a cat image is still a cat after cropping or color changes, the model should produce the same label.

Capacity Control

Capacity control means adjusting how flexible the model is.

A small model may underfit. A large model may overfit. The right capacity depends on data size, noise level, task difficulty, and regularization.

In classical machine learning, capacity is often controlled by choosing a model class. In deep learning, capacity is controlled by architecture and training choices:

ControlExample
WidthNumber of hidden units
DepthNumber of layers
Parameter sharingConvolutions, recurrent weights
SparsityMixture-of-experts routing
RegularizationDropout, weight decay
Training durationEarly stopping

Parameter sharing is especially important. Convolutional layers can generalize better than fully connected layers on images because they encode translation structure.

Double Descent

Classical bias-variance theory suggests that test error decreases at first, then increases as model capacity becomes too large. Modern deep learning often shows a different pattern called double descent.

As model capacity increases:

  1. Test error decreases.
  2. Test error increases near the interpolation threshold.
  3. Test error decreases again for highly overparameterized models.

The interpolation threshold is the point where the model can fit the training data nearly perfectly.

Double descent helps explain why very large neural networks can generalize well despite having enough parameters to memorize the training set. It does not mean larger models always perform better. Data quality, optimization, regularization, and architecture still matter.

Practical Diagnosis

A practical workflow:

ObservationDiagnosisPossible action
Train loss high, val loss highUnderfittingLarger model, train longer, tune optimizer
Train loss low, val loss highOverfittingMore data, regularization, augmentation
Train loss decreasing, val loss risingOverfitting during trainingEarly stopping
Train and val both unstableOptimization or data issueLower learning rate, inspect data
Test much worse than valValidation overuse or distribution shiftRebuild split, audit leakage
Train loss fails immediatelyImplementation bugCheck labels, shapes, loss, dtype

The first response to poor performance should be measurement, not guessing. Plot curves. Inspect examples. Compare train and validation metrics. Check that the split matches the deployment setting.

PyTorch Evaluation Pattern

A reliable evaluation function avoids training-mode behavior and disables gradient tracking.

import torch

def evaluate(model, dataloader, loss_fn, device):
    model.eval()

    total_loss = 0.0
    total_examples = 0
    correct = 0

    with torch.no_grad():
        for x, y in dataloader:
            x = x.to(device)
            y = y.to(device)

            logits = model(x)
            loss = loss_fn(logits, y)

            batch_size = x.size(0)
            total_loss += loss.item() * batch_size
            total_examples += batch_size

            pred = logits.argmax(dim=-1)
            correct += (pred == y).sum().item()

    avg_loss = total_loss / total_examples
    accuracy = correct / total_examples

    return avg_loss, accuracy

The call model.eval() matters. Dropout and batch normalization behave differently during training and evaluation. Forgetting this call can produce misleading validation metrics.

Summary

Underfitting means the model learns too little. Training and validation performance are both poor. Overfitting means the model learns the training set too specifically. Training performance is strong, but validation performance is weak.

Reducing underfitting usually requires more capacity, better optimization, better features, or less regularization. Reducing overfitting usually requires more data, stronger regularization, better augmentation, early stopping, cleaner labels, or improved data splits.

Training and validation curves give the clearest first diagnosis.