Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.
Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.
A model with leakage may appear accurate in experiments and fail after deployment. This is one of the most common reasons machine learning systems disappoint in production.
What Data Leakage Means
A clean experiment separates information by role.
Training data is used to fit parameters. Validation data is used to choose models and hyperparameters. Test data is used only for final evaluation.
Leakage breaks this separation.
For example, suppose we normalize a dataset using the mean and standard deviation of all examples before splitting into train, validation, and test sets. The test set has influenced preprocessing. The model has received information about the test distribution before evaluation.
Correct procedure:
mean = train_data.mean()
std = train_data.std()
train_data = (train_data - mean) / std
val_data = (val_data - mean) / std
test_data = (test_data - mean) / stdThe validation and test sets use statistics computed from the training set only.
Common Sources of Leakage
Leakage can be obvious or subtle.
| Leakage type | Example |
|---|---|
| Duplicate leakage | Same example appears in train and test |
| Preprocessing leakage | Statistics computed on full dataset |
| Label leakage | Input feature directly encodes the target |
| Temporal leakage | Model uses future information |
| Group leakage | Same user, patient, document, or video appears in multiple splits |
| Hyperparameter leakage | Test set used repeatedly for model selection |
| Augmentation leakage | Augmented versions of same sample split across train and test |
Duplicate leakage is especially common in web-scale datasets. A model may appear to generalize, while it is partly recalling repeated examples.
Label Leakage
Label leakage happens when the input contains information derived from the target.
Suppose we predict whether a patient will be readmitted to a hospital. If the feature table includes “readmission billing code,” the model can solve the task using information that would only exist after the event.
Another example: predicting whether a user will cancel a subscription while including a feature called cancellation_date.
The model may achieve high validation accuracy, but the result is meaningless. The feature would not be available at prediction time.
A good experimental design asks:
At the moment of prediction, would this information actually be known?
If the answer is no, the feature should be removed.
Temporal Leakage
Temporal leakage occurs when training uses information from the future.
This is common in forecasting, recommendation systems, finance, logs, and user behavior modeling.
For example, suppose we train a recommender system using all user interactions from January to December, then evaluate predictions for March. The model has already seen behavior from April through December, which would not have existed in March.
A temporal split avoids this:
| Split | Time period |
|---|---|
| Training | January to August |
| Validation | September |
| Test | October |
For deployment-like evaluation, the model should train on past data and predict future data.
Group Leakage
Group leakage occurs when related examples are split across training and evaluation sets.
Examples:
| Domain | Group identifier |
|---|---|
| Medical imaging | Patient ID |
| Speech recognition | Speaker ID |
| Recommendation | User ID |
| Documents | Source document |
| Video classification | Video ID |
| Web classification | Domain or website |
If images from the same patient appear in both training and test sets, the model may learn patient-specific artifacts. This does not measure generalization to new patients.
Use group-based splitting when the deployment task requires generalization to new groups.
In Python, group splitting can be done with scikit-learn:
from sklearn.model_selection import GroupShuffleSplit
splitter = GroupShuffleSplit(
n_splits=1,
test_size=0.2,
random_state=42,
)
train_idx, test_idx = next(
splitter.split(X, y, groups=patient_ids)
)Preprocessing Leakage
Preprocessing must be fit only on the training set.
This applies to:
| Preprocessing step | Fit using |
|---|---|
| Mean and standard deviation | Training set only |
| Vocabulary construction | Training set only |
| Feature selection | Training set only |
| Imputation values | Training set only |
| PCA components | Training set only |
| Tokenizer adaptation | Training set only |
| Class weights | Training set only |
For example, if missing values are filled using the median of the whole dataset, then test information leaks into training.
Correct pattern:
imputer.fit(train_features)
train_features = imputer.transform(train_features)
val_features = imputer.transform(val_features)
test_features = imputer.transform(test_features)The operation is fit on training data and applied to the other splits.
Leakage Through Model Selection
The test set should not guide model choice.
If we evaluate ten models on the test set and choose the one with the best test score, the test set has become a validation set. The selected model’s test score is biased upward.
Correct workflow:
- Train candidate models on the training set.
- Compare candidates on the validation set.
- Select one final model.
- Evaluate once on the test set.
If the test set is used repeatedly, create a new held-out test set or report that the original test score is no longer a clean final estimate.
Experimental Design
Experimental design defines how evidence is produced. A good experiment answers a clear question under controlled conditions.
For deep learning, an experiment should specify:
| Component | Example |
|---|---|
| Dataset | Source, size, filters, split rule |
| Task | Classification, regression, retrieval |
| Inputs | Available features at prediction time |
| Target | Label definition and time horizon |
| Model | Architecture and parameter count |
| Loss | Training objective |
| Metrics | Primary and diagnostic metrics |
| Baselines | Simple and strong comparisons |
| Random seeds | Repeated runs when needed |
| Compute budget | Training steps, hardware, precision |
| Selection rule | How the final model is chosen |
Without this information, a reported score is hard to interpret.
Baselines
A baseline is a simpler system used for comparison.
A deep model should be compared against reasonable baselines. Otherwise, improvement is difficult to judge.
Examples:
| Task | Baseline |
|---|---|
| Classification | Majority class, logistic regression |
| Regression | Predict mean, linear regression |
| Image classification | Small CNN, pretrained ResNet |
| Text classification | Bag-of-words linear model |
| Retrieval | BM25 |
| Forecasting | Last-value predictor |
| Recommendation | Popularity ranking |
A baseline prevents false progress. If a large neural network barely beats a simple model, the added complexity may not be justified.
Ablation Studies
An ablation study removes or changes one component at a time to measure its contribution.
For example, suppose a model uses:
- Data augmentation
- Dropout
- Weight decay
- Pretraining
An ablation study might train variants without each component:
| Variant | Purpose |
|---|---|
| Full model | Reference system |
| Without augmentation | Measure augmentation contribution |
| Without dropout | Measure dropout contribution |
| Without weight decay | Measure weight decay contribution |
| Without pretraining | Measure pretraining contribution |
Ablations help separate real improvements from accidental effects.
Reproducibility
Reproducibility means that another run, or another researcher, can obtain the same result within expected variation.
Deep learning experiments are affected by random initialization, data order, augmentation, nondeterministic kernels, and hardware differences.
A basic reproducibility setup:
import random
import numpy as np
import torch
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)For stronger determinism:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = FalseFull determinism can reduce performance and may not be possible for every operation. Still, recording seeds and environment details is essential.
Reporting Results
A result should include enough detail to be checked.
A minimal report includes:
| Field | Example |
|---|---|
| Dataset version | imagenet-1k-2012 |
| Split rule | Stratified 80/10/10 |
| Model | ResNet-50 |
| Optimizer | AdamW |
| Learning rate | |
| Batch size | 256 |
| Epochs or steps | 90 epochs |
| Primary metric | Top-1 accuracy |
| Random seeds | 3 runs |
| Mean and variation | % |
| Hardware | 8 A100 GPUs |
| Precision | bfloat16 mixed precision |
Single-run scores can be misleading, especially on small datasets. Repeated runs make results more credible.
Evaluation by Slice
Aggregate scores hide failures.
A model may perform well overall while failing on rare classes, long inputs, specific languages, new users, certain devices, or recent data.
Slice evaluation computes metrics on meaningful subsets.
| Slice type | Example |
|---|---|
| Class | Per-class accuracy |
| Length | Short versus long sequences |
| Source | Website, sensor, hospital |
| Time | Old versus recent examples |
| Geography | Region or country |
| Difficulty | Easy versus hard examples |
Slice evaluation is especially important when the model will be used in high-stakes or heterogeneous environments.
Checklist for Clean Experiments
Before trusting a result, check:
| Question | Why it matters |
|---|---|
| Was the split created before preprocessing? | Prevents preprocessing leakage |
| Are duplicates removed across splits? | Prevents memorization |
| Are related examples grouped correctly? | Prevents group leakage |
| Does the split respect time? | Prevents future information leakage |
| Are test scores used only once? | Prevents model-selection bias |
| Is the baseline strong enough? | Prevents exaggerated claims |
| Are metrics aligned with costs? | Prevents optimizing the wrong behavior |
| Are random seeds recorded? | Supports reproducibility |
| Are failure slices inspected? | Reveals hidden weaknesses |
Summary
Data leakage gives models information they should not have. It can enter through duplicates, preprocessing, labels, time, groups, augmentation, or repeated test-set use.
Good experimental design prevents leakage, defines clear splits, uses proper baselines, reports reproducible details, and evaluates meaningful slices.
A deep learning result is only as trustworthy as the experiment that produced it.