A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.
A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.
In PyTorch, the two central abstractions for this are Dataset and DataLoader.
A Dataset defines how to access one example. A DataLoader defines how to combine many examples into batches and feed them to the training loop.
The Role of a Dataset
A dataset represents a collection of examples. Each example may contain an input, a target, and sometimes extra metadata.
For supervised learning, one example usually has the form
where is the input and is the target.
Examples:
| Task | Input | Target |
|---|---|---|
| Image classification | Image tensor | Class index |
| Text classification | Token sequence | Class index |
| Regression | Feature vector | Real number |
| Segmentation | Image tensor | Pixel label mask |
| Language modeling | Token sequence | Shifted token sequence |
In PyTorch, a dataset should answer two basic questions:
len(dataset) # how many examples?
dataset[i] # what is example i?A minimal custom dataset looks like this:
import torch
from torch.utils.data import Dataset
class TensorDataset(Dataset):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
return self.x[index], self.y[index]Usage:
x = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))
dataset = TensorDataset(x, y)
sample_x, sample_y = dataset[0]
print(sample_x.shape) # torch.Size([10])
print(sample_y) # scalar class labelThe dataset returns one example at a time. It does not decide the batch size, the shuffling policy, or the number of worker processes. Those belong to the DataLoader.
The Role of a DataLoader
A DataLoader wraps a dataset and produces batches.
from torch.utils.data import DataLoader
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
)Now the training loop can iterate over batches:
for batch_x, batch_y in loader:
print(batch_x.shape)
print(batch_y.shape)
breakFor the previous dataset, this prints:
torch.Size([32, 10])
torch.Size([32])The dataset returns individual examples of shape [10]. The dataloader stacks 32 examples into a batch of shape [32, 10].
This is the usual division of responsibility:
| Component | Responsibility |
|---|---|
Dataset | Load or construct one example |
DataLoader | Batch, shuffle, parallelize, and iterate |
| Training loop | Move tensors to device, run model, compute loss, update parameters |
Map-Style Datasets
The most common PyTorch dataset is a map-style dataset. It supports random access by index.
dataset[i]Map-style datasets usually implement:
__len__
__getitem__Example:
from torch.utils.data import Dataset
class NumberDataset(Dataset):
def __init__(self, n):
self.n = n
def __len__(self):
return self.n
def __getitem__(self, index):
x = torch.tensor([float(index)])
y = torch.tensor(float(index * index))
return x, yThis dataset maps an integer index to an example.
dataset = NumberDataset(5)
for i in range(len(dataset)):
print(dataset[i])Map-style datasets work well when examples are stored in a list, table, directory, archive, or database where each item can be selected independently.
Iterable Datasets
Some datasets do not have efficient random access. Examples may arrive from a stream, a log, a socket, a web dataset, or a very large compressed file. In these cases, PyTorch provides IterableDataset.
An iterable dataset defines an iterator instead of indexed access.
from torch.utils.data import IterableDataset
class CountingDataset(IterableDataset):
def __init__(self, limit):
self.limit = limit
def __iter__(self):
for i in range(self.limit):
x = torch.tensor([float(i)])
y = torch.tensor(float(i * i))
yield x, yUsage:
dataset = CountingDataset(100)
loader = DataLoader(dataset, batch_size=8)
for x, y in loader:
print(x.shape, y.shape)
breakIterable datasets are useful for large-scale training, streaming corpora, generated data, and distributed pipelines. They require more care with shuffling and worker partitioning, because the data stream must be divided correctly across workers and devices.
Batch Size
The batch size controls how many examples are processed in one training step.
If each input is a vector in , then a batch has shape
where is the batch size.
For images, if each example has shape
then a batch has shape
Example:
images = torch.randn(128, 3, 224, 224)
labels = torch.randint(0, 1000, (128,))Here the batch size is 128.
A larger batch improves hardware utilization but uses more memory. A smaller batch uses less memory but gives noisier gradient estimates and may underuse the GPU.
In practice, batch size is constrained by GPU memory, model size, input size, and optimizer state.
Shuffling
Training data is usually shuffled so that consecutive batches do not follow the original order of the dataset.
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
)Shuffling matters because stochastic gradient descent assumes that mini-batches give approximately unbiased estimates of the full training gradient. If the data is ordered by class, time, file name, or difficulty, then non-shuffled batches can produce unstable training.
For validation and testing, shuffling is usually disabled:
val_loader = DataLoader(
val_dataset,
batch_size=64,
shuffle=False,
)Evaluation should be deterministic unless there is a specific reason to randomize it.
Collation
A dataloader must combine individual examples into one batch. This process is called collation.
For simple tensors with the same shape, PyTorch’s default collation stacks them automatically.
If a dataset returns examples like this:
x.shape == torch.Size([10])
y.shape == torch.Size([])then a batch of 32 examples becomes:
batch_x.shape == torch.Size([32, 10])
batch_y.shape == torch.Size([32])The default collation works when all tensors in a field have the same shape.
For variable-length data, such as text sequences, custom collation is often required. Suppose each example is a token sequence of different length. These sequences cannot be stacked directly.
A common solution is padding:
from torch.nn.utils.rnn import pad_sequence
def collate_tokens(batch):
sequences, labels = zip(*batch)
sequences = [torch.tensor(seq, dtype=torch.long) for seq in sequences]
labels = torch.tensor(labels, dtype=torch.long)
padded = pad_sequence(
sequences,
batch_first=True,
padding_value=0,
)
return padded, labelsUse it in a dataloader:
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
collate_fn=collate_tokens,
)A custom collate_fn is common in NLP, detection, segmentation, graph learning, and multimodal models.
A Complete Small Example
The following example creates a synthetic binary classification dataset and trains a small model from a dataloader.
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
class BinaryDataset(Dataset):
def __init__(self, n=1000):
self.x = torch.randn(n, 2)
# A simple linear rule with noise.
score = self.x[:, 0] - 0.5 * self.x[:, 1]
self.y = (score > 0).long()
def __len__(self):
return len(self.x)
def __getitem__(self, index):
return self.x[index], self.y[index]
dataset = BinaryDataset(n=1000)
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
)
model = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),
nn.Linear(16, 2),
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(5):
total_loss = 0.0
for x, y in loader:
logits = model(x)
loss = loss_fn(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() * x.size(0)
avg_loss = total_loss / len(dataset)
print(f"epoch={epoch} loss={avg_loss:.4f}")This example contains the standard training structure:
- The dataset returns one example.
- The dataloader builds a batch.
- The model computes logits.
- The loss compares logits with labels.
- Backpropagation computes gradients.
- The optimizer updates parameters.
Moving Batches to a Device
The dataloader usually returns CPU tensors. If the model is on a GPU, the batch must be moved to the same device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
for x, y in loader:
x = x.to(device)
y = y.to(device)
logits = model(x)
loss = loss_fn(logits, y)Model parameters and input tensors must be on the same device. Otherwise PyTorch raises a device mismatch error.
For nested batches, it is useful to write a helper function:
def move_to_device(batch, device):
if torch.is_tensor(batch):
return batch.to(device)
if isinstance(batch, dict):
return {k: move_to_device(v, device) for k, v in batch.items()}
if isinstance(batch, list):
return [move_to_device(v, device) for v in batch]
if isinstance(batch, tuple):
return tuple(move_to_device(v, device) for v in batch)
return batchThen:
for batch in loader:
batch = move_to_device(batch, device)This pattern is useful when a batch contains tensors, masks, labels, strings, IDs, and metadata.
Multiple Workers
Data loading can become a bottleneck. If the GPU waits for the CPU to load and preprocess data, training speed suffers.
PyTorch can use multiple worker processes:
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4,
)The num_workers argument controls how many subprocesses load data in parallel.
Common settings:
| Setting | Meaning |
|---|---|
num_workers=0 | Load data in the main process |
num_workers=2 | Use two worker processes |
num_workers=4 or more | Useful for image decoding and heavy preprocessing |
The best value depends on CPU cores, disk speed, preprocessing cost, and batch size. Too many workers can increase memory use and process overhead.
When debugging a dataset, start with:
num_workers=0This gives clearer error messages. After the dataset works, increase the number of workers.
Pinned Memory
When training on a CUDA GPU, setting pin_memory=True can speed up CPU-to-GPU transfers.
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4,
pin_memory=True,
)Pinned memory means page-locked CPU memory. CUDA can copy from pinned memory to GPU memory more efficiently.
When using pinned memory, the batch transfer can also be non-blocking:
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)This is mainly useful when the device is CUDA. It usually has little benefit for CPU-only training.
Dropping the Last Batch
If the dataset size is not divisible by the batch size, the last batch will be smaller.
For example, 1000 examples with batch size 128 gives seven full batches and one final batch of 104 examples.
By default, PyTorch keeps the final smaller batch. Sometimes this is undesirable, especially with batch normalization or distributed training. Use drop_last=True to discard it:
loader = DataLoader(
dataset,
batch_size=128,
shuffle=True,
drop_last=True,
)This ensures every batch has the same size.
For validation and testing, drop_last=False is usually preferred, so every example is evaluated.
Reproducibility
Shuffling and data augmentation introduce randomness. For reproducible experiments, set random seeds.
import random
import numpy as np
import torch
seed = 123
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)For dataloader workers, define a worker initialization function:
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)Use a generator for deterministic shuffling:
g = torch.Generator()
g.manual_seed(123)
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
worker_init_fn=seed_worker,
generator=g,
)Exact reproducibility may still depend on hardware, CUDA kernels, library versions, and nondeterministic operations. Still, controlling dataloader randomness is a necessary first step.
Train, Validation, and Test Datasets
A complete project usually uses separate datasets for training, validation, and testing.
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True,
)
val_loader = DataLoader(
val_dataset,
batch_size=128,
shuffle=False,
)
test_loader = DataLoader(
test_dataset,
batch_size=128,
shuffle=False,
)The training set is used to update model parameters. The validation set is used to select hyperparameters and monitor generalization. The test set is used for final evaluation.
The test set should not influence model design, early stopping, checkpoint selection, or hyperparameter search.
Common Dataset Mistakes
Many training failures come from data pipeline errors rather than model architecture.
Common mistakes include:
| Mistake | Consequence |
|---|---|
| Labels have wrong dtype | Loss function error |
| Images have wrong shape order | Model receives invalid input |
| Inputs remain on CPU while model is on GPU | Device mismatch error |
| Validation data is shuffled with random augmentation | Noisy evaluation |
| Class indices start at 1 instead of 0 | Cross-entropy error or wrong learning |
| Variable-length examples use default collation | Stack size error |
| Dataset returns NumPy arrays with unexpected dtype | Silent type mismatch |
| Test data influences preprocessing statistics | Data leakage |
For classification with nn.CrossEntropyLoss, labels should be integer class indices:
labels.dtype == torch.long
labels.shape == torch.Size([B])The model output should be raw logits:
logits.shape == torch.Size([B, num_classes])Do not apply softmax before nn.CrossEntropyLoss; the loss function applies the required log-softmax internally.
Practical DataLoader Template
A useful default template for many supervised learning projects is:
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True,
num_workers=4,
pin_memory=True,
drop_last=True,
)
val_loader = DataLoader(
val_dataset,
batch_size=128,
shuffle=False,
num_workers=4,
pin_memory=True,
drop_last=False,
)For debugging:
debug_loader = DataLoader(
train_dataset,
batch_size=4,
shuffle=False,
num_workers=0,
)Inspect the first batch before training:
batch = next(iter(debug_loader))
x, y = batch
print(x.shape, x.dtype)
print(y.shape, y.dtype)
print(x.min(), x.max())
print(y[:10])This simple inspection often catches errors early.
Summary
A Dataset describes how to retrieve one example. A DataLoader describes how to produce batches from a dataset.
The dataset owns example-level logic: reading files, parsing records, applying transforms, and returning tensors. The dataloader owns iteration-level logic: batching, shuffling, multiprocessing, collation, pinned memory, and dropping incomplete batches.
A correct data pipeline gives the training loop batches with predictable shapes, correct data types, and consistent semantics. Before improving a model, verify the data. In deep learning, many apparent model problems are tensor, label, batching, or preprocessing problems.