Datasets and DataLoaders

A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.

In PyTorch, the two central abstractions for this are Dataset and DataLoader.

A Dataset defines how to access one example. A DataLoader defines how to combine many examples into batches and feed them to the training loop.

The Role of a Dataset

A dataset represents a collection of examples. Each example may contain an input, a target, and sometimes extra metadata.

For supervised learning, one example usually has the form

(x_i, y_i)

where $x_i$ is the input and $y_i$ is the target.

Examples:

Task	Input $x_i$	Target $y_i$
Image classification	Image tensor	Class index
Text classification	Token sequence	Class index
Regression	Feature vector	Real number
Segmentation	Image tensor	Pixel label mask
Language modeling	Token sequence	Shifted token sequence

In PyTorch, a dataset should answer two basic questions:

len(dataset)      # how many examples?
dataset[i]        # what is example i?

A minimal custom dataset looks like this:

import torch
from torch.utils.data import Dataset

class TensorDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        return self.x[index], self.y[index]

Usage:

x = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))

dataset = TensorDataset(x, y)

sample_x, sample_y = dataset[0]
print(sample_x.shape)  # torch.Size([10])
print(sample_y)        # scalar class label

The dataset returns one example at a time. It does not decide the batch size, the shuffling policy, or the number of worker processes. Those belong to the DataLoader.

The Role of a DataLoader

A DataLoader wraps a dataset and produces batches.

from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)

Now the training loop can iterate over batches:

for batch_x, batch_y in loader:
    print(batch_x.shape)
    print(batch_y.shape)
    break

For the previous dataset, this prints:

torch.Size([32, 10])
torch.Size([32])

The dataset returns individual examples of shape [10]. The dataloader stacks 32 examples into a batch of shape [32, 10].

This is the usual division of responsibility:

Component	Responsibility
`Dataset`	Load or construct one example
`DataLoader`	Batch, shuffle, parallelize, and iterate
Training loop	Move tensors to device, run model, compute loss, update parameters

Map-Style Datasets

The most common PyTorch dataset is a map-style dataset. It supports random access by index.

dataset[i]

Map-style datasets usually implement:

__len__
__getitem__

Example:

from torch.utils.data import Dataset

class NumberDataset(Dataset):
    def __init__(self, n):
        self.n = n

    def __len__(self):
        return self.n

    def __getitem__(self, index):
        x = torch.tensor([float(index)])
        y = torch.tensor(float(index * index))
        return x, y

This dataset maps an integer index to an example.

dataset = NumberDataset(5)

for i in range(len(dataset)):
    print(dataset[i])

Map-style datasets work well when examples are stored in a list, table, directory, archive, or database where each item can be selected independently.

Iterable Datasets

Some datasets do not have efficient random access. Examples may arrive from a stream, a log, a socket, a web dataset, or a very large compressed file. In these cases, PyTorch provides IterableDataset.

An iterable dataset defines an iterator instead of indexed access.

from torch.utils.data import IterableDataset

class CountingDataset(IterableDataset):
    def __init__(self, limit):
        self.limit = limit

    def __iter__(self):
        for i in range(self.limit):
            x = torch.tensor([float(i)])
            y = torch.tensor(float(i * i))
            yield x, y

Usage:

dataset = CountingDataset(100)

loader = DataLoader(dataset, batch_size=8)

for x, y in loader:
    print(x.shape, y.shape)
    break

Iterable datasets are useful for large-scale training, streaming corpora, generated data, and distributed pipelines. They require more care with shuffling and worker partitioning, because the data stream must be divided correctly across workers and devices.

Batch Size

The batch size controls how many examples are processed in one training step.

If each input is a vector in $\mathbb{R}^d$ , then a batch has shape

[B, d]

where $B$ is the batch size.

For images, if each example has shape

[C, H, W],

then a batch has shape

[B, C, H, W].

Example:

images = torch.randn(128, 3, 224, 224)
labels = torch.randint(0, 1000, (128,))

Here the batch size is 128.

A larger batch improves hardware utilization but uses more memory. A smaller batch uses less memory but gives noisier gradient estimates and may underuse the GPU.

In practice, batch size is constrained by GPU memory, model size, input size, and optimizer state.

Shuffling

Training data is usually shuffled so that consecutive batches do not follow the original order of the dataset.

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)

Shuffling matters because stochastic gradient descent assumes that mini-batches give approximately unbiased estimates of the full training gradient. If the data is ordered by class, time, file name, or difficulty, then non-shuffled batches can produce unstable training.

For validation and testing, shuffling is usually disabled:

val_loader = DataLoader(
    val_dataset,
    batch_size=64,
    shuffle=False,
)

Evaluation should be deterministic unless there is a specific reason to randomize it.

Collation

A dataloader must combine individual examples into one batch. This process is called collation.

For simple tensors with the same shape, PyTorch’s default collation stacks them automatically.

If a dataset returns examples like this:

x.shape == torch.Size([10])
y.shape == torch.Size([])

then a batch of 32 examples becomes:

batch_x.shape == torch.Size([32, 10])
batch_y.shape == torch.Size([32])

The default collation works when all tensors in a field have the same shape.

For variable-length data, such as text sequences, custom collation is often required. Suppose each example is a token sequence of different length. These sequences cannot be stacked directly.

A common solution is padding:

from torch.nn.utils.rnn import pad_sequence

def collate_tokens(batch):
    sequences, labels = zip(*batch)

    sequences = [torch.tensor(seq, dtype=torch.long) for seq in sequences]
    labels = torch.tensor(labels, dtype=torch.long)

    padded = pad_sequence(
        sequences,
        batch_first=True,
        padding_value=0,
    )

    return padded, labels

Use it in a dataloader:

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_tokens,
)

A custom collate_fn is common in NLP, detection, segmentation, graph learning, and multimodal models.

A Complete Small Example

The following example creates a synthetic binary classification dataset and trains a small model from a dataloader.

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

class BinaryDataset(Dataset):
    def __init__(self, n=1000):
        self.x = torch.randn(n, 2)

        # A simple linear rule with noise.
        score = self.x[:, 0] - 0.5 * self.x[:, 1]
        self.y = (score > 0).long()

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        return self.x[index], self.y[index]

dataset = BinaryDataset(n=1000)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)

model = nn.Sequential(
    nn.Linear(2, 16),
    nn.ReLU(),
    nn.Linear(16, 2),
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(5):
    total_loss = 0.0

    for x, y in loader:
        logits = model(x)
        loss = loss_fn(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x.size(0)

    avg_loss = total_loss / len(dataset)
    print(f"epoch={epoch} loss={avg_loss:.4f}")

This example contains the standard training structure:

The dataset returns one example.
The dataloader builds a batch.
The model computes logits.
The loss compares logits with labels.
Backpropagation computes gradients.
The optimizer updates parameters.

Moving Batches to a Device

The dataloader usually returns CPU tensors. If the model is on a GPU, the batch must be moved to the same device.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)

for x, y in loader:
    x = x.to(device)
    y = y.to(device)

    logits = model(x)
    loss = loss_fn(logits, y)

Model parameters and input tensors must be on the same device. Otherwise PyTorch raises a device mismatch error.

For nested batches, it is useful to write a helper function:

def move_to_device(batch, device):
    if torch.is_tensor(batch):
        return batch.to(device)

    if isinstance(batch, dict):
        return {k: move_to_device(v, device) for k, v in batch.items()}

    if isinstance(batch, list):
        return [move_to_device(v, device) for v in batch]

    if isinstance(batch, tuple):
        return tuple(move_to_device(v, device) for v in batch)

    return batch

Then:

for batch in loader:
    batch = move_to_device(batch, device)

This pattern is useful when a batch contains tensors, masks, labels, strings, IDs, and metadata.

Multiple Workers

Data loading can become a bottleneck. If the GPU waits for the CPU to load and preprocess data, training speed suffers.

PyTorch can use multiple worker processes:

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
)

The num_workers argument controls how many subprocesses load data in parallel.

Common settings:

Setting	Meaning
`num_workers=0`	Load data in the main process
`num_workers=2`	Use two worker processes
`num_workers=4` or more	Useful for image decoding and heavy preprocessing

The best value depends on CPU cores, disk speed, preprocessing cost, and batch size. Too many workers can increase memory use and process overhead.

When debugging a dataset, start with:

num_workers=0

This gives clearer error messages. After the dataset works, increase the number of workers.

Pinned Memory

When training on a CUDA GPU, setting pin_memory=True can speed up CPU-to-GPU transfers.

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

Pinned memory means page-locked CPU memory. CUDA can copy from pinned memory to GPU memory more efficiently.

When using pinned memory, the batch transfer can also be non-blocking:

x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)

This is mainly useful when the device is CUDA. It usually has little benefit for CPU-only training.

Dropping the Last Batch

If the dataset size is not divisible by the batch size, the last batch will be smaller.

For example, 1000 examples with batch size 128 gives seven full batches and one final batch of 104 examples.

By default, PyTorch keeps the final smaller batch. Sometimes this is undesirable, especially with batch normalization or distributed training. Use drop_last=True to discard it:

loader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    drop_last=True,
)

This ensures every batch has the same size.

For validation and testing, drop_last=False is usually preferred, so every example is evaluated.

Reproducibility

Shuffling and data augmentation introduce randomness. For reproducible experiments, set random seeds.

import random
import numpy as np
import torch

seed = 123

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

For dataloader workers, define a worker initialization function:

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

Use a generator for deterministic shuffling:

g = torch.Generator()
g.manual_seed(123)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    worker_init_fn=seed_worker,
    generator=g,
)

Exact reproducibility may still depend on hardware, CUDA kernels, library versions, and nondeterministic operations. Still, controlling dataloader randomness is a necessary first step.

Train, Validation, and Test Datasets

A complete project usually uses separate datasets for training, validation, and testing.

train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,
    shuffle=False,
)

test_loader = DataLoader(
    test_dataset,
    batch_size=128,
    shuffle=False,
)

The training set is used to update model parameters. The validation set is used to select hyperparameters and monitor generalization. The test set is used for final evaluation.

The test set should not influence model design, early stopping, checkpoint selection, or hyperparameter search.

Common Dataset Mistakes

Many training failures come from data pipeline errors rather than model architecture.

Common mistakes include:

Mistake	Consequence
Labels have wrong dtype	Loss function error
Images have wrong shape order	Model receives invalid input
Inputs remain on CPU while model is on GPU	Device mismatch error
Validation data is shuffled with random augmentation	Noisy evaluation
Class indices start at 1 instead of 0	Cross-entropy error or wrong learning
Variable-length examples use default collation	Stack size error
Dataset returns NumPy arrays with unexpected dtype	Silent type mismatch
Test data influences preprocessing statistics	Data leakage

For classification with nn.CrossEntropyLoss, labels should be integer class indices:

labels.dtype == torch.long
labels.shape == torch.Size([B])

The model output should be raw logits:

logits.shape == torch.Size([B, num_classes])

Do not apply softmax before nn.CrossEntropyLoss; the loss function applies the required log-softmax internally.

Practical DataLoader Template

A useful default template for many supervised learning projects is:

train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    drop_last=True,
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
    drop_last=False,
)

For debugging:

debug_loader = DataLoader(
    train_dataset,
    batch_size=4,
    shuffle=False,
    num_workers=0,
)

Inspect the first batch before training:

batch = next(iter(debug_loader))

x, y = batch

print(x.shape, x.dtype)
print(y.shape, y.dtype)
print(x.min(), x.max())
print(y[:10])

This simple inspection often catches errors early.

Summary

A Dataset describes how to retrieve one example. A DataLoader describes how to produce batches from a dataset.

The dataset owns example-level logic: reading files, parsing records, applying transforms, and returning tensors. The dataloader owns iteration-level logic: batching, shuffling, multiprocessing, collation, pinned memory, and dropping incomplete batches.

A correct data pipeline gives the training loop batches with predictable shapes, correct data types, and consistent semantics. Before improving a model, verify the data. In deep learning, many apparent model problems are tensor, label, batching, or preprocessing problems.