Skip to content

Data Leakage and Experimental Design

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

A model with leakage may appear accurate in experiments and fail after deployment. This is one of the most common reasons machine learning systems disappoint in production.

What Data Leakage Means

A clean experiment separates information by role.

Training data is used to fit parameters. Validation data is used to choose models and hyperparameters. Test data is used only for final evaluation.

Leakage breaks this separation.

For example, suppose we normalize a dataset using the mean and standard deviation of all examples before splitting into train, validation, and test sets. The test set has influenced preprocessing. The model has received information about the test distribution before evaluation.

Correct procedure:

mean = train_data.mean()
std = train_data.std()

train_data = (train_data - mean) / std
val_data = (val_data - mean) / std
test_data = (test_data - mean) / std

The validation and test sets use statistics computed from the training set only.

Common Sources of Leakage

Leakage can be obvious or subtle.

Leakage typeExample
Duplicate leakageSame example appears in train and test
Preprocessing leakageStatistics computed on full dataset
Label leakageInput feature directly encodes the target
Temporal leakageModel uses future information
Group leakageSame user, patient, document, or video appears in multiple splits
Hyperparameter leakageTest set used repeatedly for model selection
Augmentation leakageAugmented versions of same sample split across train and test

Duplicate leakage is especially common in web-scale datasets. A model may appear to generalize, while it is partly recalling repeated examples.

Label Leakage

Label leakage happens when the input contains information derived from the target.

Suppose we predict whether a patient will be readmitted to a hospital. If the feature table includes “readmission billing code,” the model can solve the task using information that would only exist after the event.

Another example: predicting whether a user will cancel a subscription while including a feature called cancellation_date.

The model may achieve high validation accuracy, but the result is meaningless. The feature would not be available at prediction time.

A good experimental design asks:

At the moment of prediction, would this information actually be known?

If the answer is no, the feature should be removed.

Temporal Leakage

Temporal leakage occurs when training uses information from the future.

This is common in forecasting, recommendation systems, finance, logs, and user behavior modeling.

For example, suppose we train a recommender system using all user interactions from January to December, then evaluate predictions for March. The model has already seen behavior from April through December, which would not have existed in March.

A temporal split avoids this:

SplitTime period
TrainingJanuary to August
ValidationSeptember
TestOctober

For deployment-like evaluation, the model should train on past data and predict future data.

Group Leakage

Group leakage occurs when related examples are split across training and evaluation sets.

Examples:

DomainGroup identifier
Medical imagingPatient ID
Speech recognitionSpeaker ID
RecommendationUser ID
DocumentsSource document
Video classificationVideo ID
Web classificationDomain or website

If images from the same patient appear in both training and test sets, the model may learn patient-specific artifacts. This does not measure generalization to new patients.

Use group-based splitting when the deployment task requires generalization to new groups.

In Python, group splitting can be done with scikit-learn:

from sklearn.model_selection import GroupShuffleSplit

splitter = GroupShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=42,
)

train_idx, test_idx = next(
    splitter.split(X, y, groups=patient_ids)
)

Preprocessing Leakage

Preprocessing must be fit only on the training set.

This applies to:

Preprocessing stepFit using
Mean and standard deviationTraining set only
Vocabulary constructionTraining set only
Feature selectionTraining set only
Imputation valuesTraining set only
PCA componentsTraining set only
Tokenizer adaptationTraining set only
Class weightsTraining set only

For example, if missing values are filled using the median of the whole dataset, then test information leaks into training.

Correct pattern:

imputer.fit(train_features)

train_features = imputer.transform(train_features)
val_features = imputer.transform(val_features)
test_features = imputer.transform(test_features)

The operation is fit on training data and applied to the other splits.

Leakage Through Model Selection

The test set should not guide model choice.

If we evaluate ten models on the test set and choose the one with the best test score, the test set has become a validation set. The selected model’s test score is biased upward.

Correct workflow:

  1. Train candidate models on the training set.
  2. Compare candidates on the validation set.
  3. Select one final model.
  4. Evaluate once on the test set.

If the test set is used repeatedly, create a new held-out test set or report that the original test score is no longer a clean final estimate.

Experimental Design

Experimental design defines how evidence is produced. A good experiment answers a clear question under controlled conditions.

For deep learning, an experiment should specify:

ComponentExample
DatasetSource, size, filters, split rule
TaskClassification, regression, retrieval
InputsAvailable features at prediction time
TargetLabel definition and time horizon
ModelArchitecture and parameter count
LossTraining objective
MetricsPrimary and diagnostic metrics
BaselinesSimple and strong comparisons
Random seedsRepeated runs when needed
Compute budgetTraining steps, hardware, precision
Selection ruleHow the final model is chosen

Without this information, a reported score is hard to interpret.

Baselines

A baseline is a simpler system used for comparison.

A deep model should be compared against reasonable baselines. Otherwise, improvement is difficult to judge.

Examples:

TaskBaseline
ClassificationMajority class, logistic regression
RegressionPredict mean, linear regression
Image classificationSmall CNN, pretrained ResNet
Text classificationBag-of-words linear model
RetrievalBM25
ForecastingLast-value predictor
RecommendationPopularity ranking

A baseline prevents false progress. If a large neural network barely beats a simple model, the added complexity may not be justified.

Ablation Studies

An ablation study removes or changes one component at a time to measure its contribution.

For example, suppose a model uses:

  1. Data augmentation
  2. Dropout
  3. Weight decay
  4. Pretraining

An ablation study might train variants without each component:

VariantPurpose
Full modelReference system
Without augmentationMeasure augmentation contribution
Without dropoutMeasure dropout contribution
Without weight decayMeasure weight decay contribution
Without pretrainingMeasure pretraining contribution

Ablations help separate real improvements from accidental effects.

Reproducibility

Reproducibility means that another run, or another researcher, can obtain the same result within expected variation.

Deep learning experiments are affected by random initialization, data order, augmentation, nondeterministic kernels, and hardware differences.

A basic reproducibility setup:

import random
import numpy as np
import torch

seed = 42

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

For stronger determinism:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Full determinism can reduce performance and may not be possible for every operation. Still, recording seeds and environment details is essential.

Reporting Results

A result should include enough detail to be checked.

A minimal report includes:

FieldExample
Dataset versionimagenet-1k-2012
Split ruleStratified 80/10/10
ModelResNet-50
OptimizerAdamW
Learning rate3×1043 \times 10^{-4}
Batch size256
Epochs or steps90 epochs
Primary metricTop-1 accuracy
Random seeds3 runs
Mean and variation76.2±0.276.2 \pm 0.2%
Hardware8 A100 GPUs
Precisionbfloat16 mixed precision

Single-run scores can be misleading, especially on small datasets. Repeated runs make results more credible.

Evaluation by Slice

Aggregate scores hide failures.

A model may perform well overall while failing on rare classes, long inputs, specific languages, new users, certain devices, or recent data.

Slice evaluation computes metrics on meaningful subsets.

Slice typeExample
ClassPer-class accuracy
LengthShort versus long sequences
SourceWebsite, sensor, hospital
TimeOld versus recent examples
GeographyRegion or country
DifficultyEasy versus hard examples

Slice evaluation is especially important when the model will be used in high-stakes or heterogeneous environments.

Checklist for Clean Experiments

Before trusting a result, check:

QuestionWhy it matters
Was the split created before preprocessing?Prevents preprocessing leakage
Are duplicates removed across splits?Prevents memorization
Are related examples grouped correctly?Prevents group leakage
Does the split respect time?Prevents future information leakage
Are test scores used only once?Prevents model-selection bias
Is the baseline strong enough?Prevents exaggerated claims
Are metrics aligned with costs?Prevents optimizing the wrong behavior
Are random seeds recorded?Supports reproducibility
Are failure slices inspected?Reveals hidden weaknesses

Summary

Data leakage gives models information they should not have. It can enter through duplicates, preprocessing, labels, time, groups, augmentation, or repeated test-set use.

Good experimental design prevents leakage, defines clear splits, uses proper baselines, reports reproducible details, and evaluates meaningful slices.

A deep learning result is only as trustworthy as the experiment that produced it.