# Datasets and DataLoaders

A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.

In PyTorch, the two central abstractions for this are `Dataset` and `DataLoader`.

A `Dataset` defines how to access one example. A `DataLoader` defines how to combine many examples into batches and feed them to the training loop.

### The Role of a Dataset

A dataset represents a collection of examples. Each example may contain an input, a target, and sometimes extra metadata.

For supervised learning, one example usually has the form

$$
(x_i, y_i)
$$

where \(x_i\) is the input and \(y_i\) is the target.

Examples:

| Task | Input \(x_i\) | Target \(y_i\) |
|---|---|---|
| Image classification | Image tensor | Class index |
| Text classification | Token sequence | Class index |
| Regression | Feature vector | Real number |
| Segmentation | Image tensor | Pixel label mask |
| Language modeling | Token sequence | Shifted token sequence |

In PyTorch, a dataset should answer two basic questions:

```python
len(dataset)      # how many examples?
dataset[i]        # what is example i?
```

A minimal custom dataset looks like this:

```python
import torch
from torch.utils.data import Dataset

class TensorDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        return self.x[index], self.y[index]
```

Usage:

```python
x = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))

dataset = TensorDataset(x, y)

sample_x, sample_y = dataset[0]
print(sample_x.shape)  # torch.Size([10])
print(sample_y)        # scalar class label
```

The dataset returns one example at a time. It does not decide the batch size, the shuffling policy, or the number of worker processes. Those belong to the `DataLoader`.

### The Role of a DataLoader

A `DataLoader` wraps a dataset and produces batches.

```python
from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)
```

Now the training loop can iterate over batches:

```python
for batch_x, batch_y in loader:
    print(batch_x.shape)
    print(batch_y.shape)
    break
```

For the previous dataset, this prints:

```python
torch.Size([32, 10])
torch.Size([32])
```

The dataset returns individual examples of shape `[10]`. The dataloader stacks 32 examples into a batch of shape `[32, 10]`.

This is the usual division of responsibility:

| Component | Responsibility |
|---|---|
| `Dataset` | Load or construct one example |
| `DataLoader` | Batch, shuffle, parallelize, and iterate |
| Training loop | Move tensors to device, run model, compute loss, update parameters |

### Map-Style Datasets

The most common PyTorch dataset is a map-style dataset. It supports random access by index.

```python
dataset[i]
```

Map-style datasets usually implement:

```python
__len__
__getitem__
```

Example:

```python
from torch.utils.data import Dataset

class NumberDataset(Dataset):
    def __init__(self, n):
        self.n = n

    def __len__(self):
        return self.n

    def __getitem__(self, index):
        x = torch.tensor([float(index)])
        y = torch.tensor(float(index * index))
        return x, y
```

This dataset maps an integer index to an example.

```python
dataset = NumberDataset(5)

for i in range(len(dataset)):
    print(dataset[i])
```

Map-style datasets work well when examples are stored in a list, table, directory, archive, or database where each item can be selected independently.

### Iterable Datasets

Some datasets do not have efficient random access. Examples may arrive from a stream, a log, a socket, a web dataset, or a very large compressed file. In these cases, PyTorch provides `IterableDataset`.

An iterable dataset defines an iterator instead of indexed access.

```python
from torch.utils.data import IterableDataset

class CountingDataset(IterableDataset):
    def __init__(self, limit):
        self.limit = limit

    def __iter__(self):
        for i in range(self.limit):
            x = torch.tensor([float(i)])
            y = torch.tensor(float(i * i))
            yield x, y
```

Usage:

```python
dataset = CountingDataset(100)

loader = DataLoader(dataset, batch_size=8)

for x, y in loader:
    print(x.shape, y.shape)
    break
```

Iterable datasets are useful for large-scale training, streaming corpora, generated data, and distributed pipelines. They require more care with shuffling and worker partitioning, because the data stream must be divided correctly across workers and devices.

### Batch Size

The batch size controls how many examples are processed in one training step.

If each input is a vector in \(\mathbb{R}^d\), then a batch has shape

$$
[B, d]
$$

where \(B\) is the batch size.

For images, if each example has shape

$$
[C, H, W],
$$

then a batch has shape

$$
[B, C, H, W].
$$

Example:

```python
images = torch.randn(128, 3, 224, 224)
labels = torch.randint(0, 1000, (128,))
```

Here the batch size is 128.

A larger batch improves hardware utilization but uses more memory. A smaller batch uses less memory but gives noisier gradient estimates and may underuse the GPU.

In practice, batch size is constrained by GPU memory, model size, input size, and optimizer state.

### Shuffling

Training data is usually shuffled so that consecutive batches do not follow the original order of the dataset.

```python
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)
```

Shuffling matters because stochastic gradient descent assumes that mini-batches give approximately unbiased estimates of the full training gradient. If the data is ordered by class, time, file name, or difficulty, then non-shuffled batches can produce unstable training.

For validation and testing, shuffling is usually disabled:

```python
val_loader = DataLoader(
    val_dataset,
    batch_size=64,
    shuffle=False,
)
```

Evaluation should be deterministic unless there is a specific reason to randomize it.

### Collation

A dataloader must combine individual examples into one batch. This process is called collation.

For simple tensors with the same shape, PyTorch’s default collation stacks them automatically.

If a dataset returns examples like this:

```python
x.shape == torch.Size([10])
y.shape == torch.Size([])
```

then a batch of 32 examples becomes:

```python
batch_x.shape == torch.Size([32, 10])
batch_y.shape == torch.Size([32])
```

The default collation works when all tensors in a field have the same shape.

For variable-length data, such as text sequences, custom collation is often required. Suppose each example is a token sequence of different length. These sequences cannot be stacked directly.

A common solution is padding:

```python
from torch.nn.utils.rnn import pad_sequence

def collate_tokens(batch):
    sequences, labels = zip(*batch)

    sequences = [torch.tensor(seq, dtype=torch.long) for seq in sequences]
    labels = torch.tensor(labels, dtype=torch.long)

    padded = pad_sequence(
        sequences,
        batch_first=True,
        padding_value=0,
    )

    return padded, labels
```

Use it in a dataloader:

```python
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_tokens,
)
```

A custom `collate_fn` is common in NLP, detection, segmentation, graph learning, and multimodal models.

### A Complete Small Example

The following example creates a synthetic binary classification dataset and trains a small model from a dataloader.

```python
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

class BinaryDataset(Dataset):
    def __init__(self, n=1000):
        self.x = torch.randn(n, 2)

        # A simple linear rule with noise.
        score = self.x[:, 0] - 0.5 * self.x[:, 1]
        self.y = (score > 0).long()

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        return self.x[index], self.y[index]

dataset = BinaryDataset(n=1000)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
)

model = nn.Sequential(
    nn.Linear(2, 16),
    nn.ReLU(),
    nn.Linear(16, 2),
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(5):
    total_loss = 0.0

    for x, y in loader:
        logits = model(x)
        loss = loss_fn(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x.size(0)

    avg_loss = total_loss / len(dataset)
    print(f"epoch={epoch} loss={avg_loss:.4f}")
```

This example contains the standard training structure:

1. The dataset returns one example.
2. The dataloader builds a batch.
3. The model computes logits.
4. The loss compares logits with labels.
5. Backpropagation computes gradients.
6. The optimizer updates parameters.

### Moving Batches to a Device

The dataloader usually returns CPU tensors. If the model is on a GPU, the batch must be moved to the same device.

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)

for x, y in loader:
    x = x.to(device)
    y = y.to(device)

    logits = model(x)
    loss = loss_fn(logits, y)
```

Model parameters and input tensors must be on the same device. Otherwise PyTorch raises a device mismatch error.

For nested batches, it is useful to write a helper function:

```python
def move_to_device(batch, device):
    if torch.is_tensor(batch):
        return batch.to(device)

    if isinstance(batch, dict):
        return {k: move_to_device(v, device) for k, v in batch.items()}

    if isinstance(batch, list):
        return [move_to_device(v, device) for v in batch]

    if isinstance(batch, tuple):
        return tuple(move_to_device(v, device) for v in batch)

    return batch
```

Then:

```python
for batch in loader:
    batch = move_to_device(batch, device)
```

This pattern is useful when a batch contains tensors, masks, labels, strings, IDs, and metadata.

### Multiple Workers

Data loading can become a bottleneck. If the GPU waits for the CPU to load and preprocess data, training speed suffers.

PyTorch can use multiple worker processes:

```python
loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
)
```

The `num_workers` argument controls how many subprocesses load data in parallel.

Common settings:

| Setting | Meaning |
|---|---|
| `num_workers=0` | Load data in the main process |
| `num_workers=2` | Use two worker processes |
| `num_workers=4` or more | Useful for image decoding and heavy preprocessing |

The best value depends on CPU cores, disk speed, preprocessing cost, and batch size. Too many workers can increase memory use and process overhead.

When debugging a dataset, start with:

```python
num_workers=0
```

This gives clearer error messages. After the dataset works, increase the number of workers.

### Pinned Memory

When training on a CUDA GPU, setting `pin_memory=True` can speed up CPU-to-GPU transfers.

```python
loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)
```

Pinned memory means page-locked CPU memory. CUDA can copy from pinned memory to GPU memory more efficiently.

When using pinned memory, the batch transfer can also be non-blocking:

```python
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
```

This is mainly useful when the device is CUDA. It usually has little benefit for CPU-only training.

### Dropping the Last Batch

If the dataset size is not divisible by the batch size, the last batch will be smaller.

For example, 1000 examples with batch size 128 gives seven full batches and one final batch of 104 examples.

By default, PyTorch keeps the final smaller batch. Sometimes this is undesirable, especially with batch normalization or distributed training. Use `drop_last=True` to discard it:

```python
loader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    drop_last=True,
)
```

This ensures every batch has the same size.

For validation and testing, `drop_last=False` is usually preferred, so every example is evaluated.

### Reproducibility

Shuffling and data augmentation introduce randomness. For reproducible experiments, set random seeds.

```python
import random
import numpy as np
import torch

seed = 123

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
```

For dataloader workers, define a worker initialization function:

```python
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)
```

Use a generator for deterministic shuffling:

```python
g = torch.Generator()
g.manual_seed(123)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    worker_init_fn=seed_worker,
    generator=g,
)
```

Exact reproducibility may still depend on hardware, CUDA kernels, library versions, and nondeterministic operations. Still, controlling dataloader randomness is a necessary first step.

### Train, Validation, and Test Datasets

A complete project usually uses separate datasets for training, validation, and testing.

```python
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,
    shuffle=False,
)

test_loader = DataLoader(
    test_dataset,
    batch_size=128,
    shuffle=False,
)
```

The training set is used to update model parameters. The validation set is used to select hyperparameters and monitor generalization. The test set is used for final evaluation.

The test set should not influence model design, early stopping, checkpoint selection, or hyperparameter search.

### Common Dataset Mistakes

Many training failures come from data pipeline errors rather than model architecture.

Common mistakes include:

| Mistake | Consequence |
|---|---|
| Labels have wrong dtype | Loss function error |
| Images have wrong shape order | Model receives invalid input |
| Inputs remain on CPU while model is on GPU | Device mismatch error |
| Validation data is shuffled with random augmentation | Noisy evaluation |
| Class indices start at 1 instead of 0 | Cross-entropy error or wrong learning |
| Variable-length examples use default collation | Stack size error |
| Dataset returns NumPy arrays with unexpected dtype | Silent type mismatch |
| Test data influences preprocessing statistics | Data leakage |

For classification with `nn.CrossEntropyLoss`, labels should be integer class indices:

```python
labels.dtype == torch.long
labels.shape == torch.Size([B])
```

The model output should be raw logits:

```python
logits.shape == torch.Size([B, num_classes])
```

Do not apply softmax before `nn.CrossEntropyLoss`; the loss function applies the required log-softmax internally.

### Practical DataLoader Template

A useful default template for many supervised learning projects is:

```python
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    drop_last=True,
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
    drop_last=False,
)
```

For debugging:

```python
debug_loader = DataLoader(
    train_dataset,
    batch_size=4,
    shuffle=False,
    num_workers=0,
)
```

Inspect the first batch before training:

```python
batch = next(iter(debug_loader))

x, y = batch

print(x.shape, x.dtype)
print(y.shape, y.dtype)
print(x.min(), x.max())
print(y[:10])
```

This simple inspection often catches errors early.

### Summary

A `Dataset` describes how to retrieve one example. A `DataLoader` describes how to produce batches from a dataset.

The dataset owns example-level logic: reading files, parsing records, applying transforms, and returning tensors. The dataloader owns iteration-level logic: batching, shuffling, multiprocessing, collation, pinned memory, and dropping incomplete batches.

A correct data pipeline gives the training loop batches with predictable shapes, correct data types, and consistent semantics. Before improving a model, verify the data. In deep learning, many apparent model problems are tensor, label, batching, or preprocessing problems.

