# Classification Pipelines

Image classification assigns one label, or a small set of labels, to an image. A model receives an image tensor as input and produces class scores as output. The class with the largest score is usually taken as the prediction.

A classification pipeline is the complete path from raw image files to trained model predictions. It includes data storage, preprocessing, batching, model definition, loss computation, optimization, validation, checkpointing, and inference.

In PyTorch, a good pipeline separates these concerns clearly. The dataset reads examples. The transform prepares tensors. The data loader builds batches. The model computes predictions. The loss function measures error. The optimizer updates parameters. The training loop coordinates the process.

### The Classification Problem

Let an image be represented by a tensor

$$
x \in \mathbb{R}^{C \times H \times W},
$$

where $C$ is the number of channels, $H$ is height, and $W$ is width.

For RGB images,

$$
C = 3.
$$

A classification dataset contains pairs

$$
(x_i, y_i),
$$

where $x_i$ is an image and $y_i$ is its class label. If there are $K$ classes, then

$$
y_i \in \{0,1,\ldots,K-1\}.
$$

A neural network classifier defines a function

$$
f_\theta : \mathbb{R}^{C \times H \times W} \to \mathbb{R}^{K},
$$

where $\theta$ denotes the learned parameters.

The output

$$
z = f_\theta(x)
$$

is a vector of logits. A logit is an unnormalized class score. Larger logits correspond to more likely classes, but logits are not probabilities.

For a batch of $B$ images, the input has shape

$$
X \in \mathbb{R}^{B \times C \times H \times W},
$$

and the output logits have shape

$$
Z \in \mathbb{R}^{B \times K}.
$$

In PyTorch, this means:

```python
images.shape  # [B, C, H, W]
logits.shape  # [B, K]
labels.shape  # [B]
```

The labels are stored as integer class indices, not as one-hot vectors, when using `torch.nn.CrossEntropyLoss`.

### Components of a Pipeline

A typical image classification pipeline has these stages:

| Stage | Responsibility |
|---|---|
| Dataset | Locate images and labels |
| Transform | Convert images into normalized tensors |
| DataLoader | Build shuffled mini-batches |
| Model | Map images to class logits |
| Loss | Compare logits with labels |
| Optimizer | Update model parameters |
| Scheduler | Adjust learning rate during training |
| Evaluator | Measure validation performance |
| Checkpoint | Save model state |
| Inference code | Run predictions on new images |

The pipeline should make each stage explicit. This reduces hidden coupling and makes experiments easier to repeat.

A minimal PyTorch training pipeline has this form:

```python
for images, labels in train_loader:
    images = images.to(device)
    labels = labels.to(device)

    logits = model(images)
    loss = loss_fn(logits, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

This loop is small, but it contains the essential training mechanism. The model computes logits. The loss computes a scalar error. Backpropagation computes gradients. The optimizer updates parameters.

### Dataset Layout

A common image dataset layout stores one directory per class:

```text
dataset/
  train/
    cat/
      img001.jpg
      img002.jpg
    dog/
      img003.jpg
      img004.jpg
  val/
    cat/
      img101.jpg
    dog/
      img102.jpg
```

PyTorch can read this layout with `torchvision.datasets.ImageFolder`.

```python
from torchvision.datasets import ImageFolder

train_set = ImageFolder(
    root="dataset/train",
    transform=train_transform,
)

val_set = ImageFolder(
    root="dataset/val",
    transform=val_transform,
)
```

`ImageFolder` assigns an integer index to each class based on directory names. The mapping is stored in:

```python
train_set.class_to_idx
```

For example:

```python
{
    "cat": 0,
    "dog": 1,
}
```

This mapping matters. The model outputs logits in this class-index order. If the mapping changes between training and inference, predictions will be interpreted incorrectly.

### Image Transforms

Raw images have different sizes, color encodings, and numeric ranges. Neural networks require tensors with consistent shape and scale.

A standard validation transform resizes the image, crops it, converts it to a tensor, and normalizes the channels:

```python
from torchvision import transforms

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])
```

`ToTensor()` converts an image from integer pixel values in $[0,255]$ to floating-point values in $[0,1]$. `Normalize` then applies channel-wise normalization:

$$
x'_{c,h,w} = \frac{x_{c,h,w} - \mu_c}{\sigma_c}.
$$

Here $\mu_c$ and $\sigma_c$ are the mean and standard deviation for channel $c$.

Training transforms usually include random augmentation:

```python
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])
```

The validation transform should be deterministic. The training transform may be random. This distinction is important because validation should measure the model, not randomness in preprocessing.

### DataLoaders

A `DataLoader` turns a dataset into mini-batches.

```python
from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)
```

For training, `shuffle=True` prevents the model from seeing examples in a fixed order. For validation, `shuffle=False` makes evaluation deterministic.

The data loader returns batches:

```python
images, labels = next(iter(train_loader))

print(images.shape)  # [64, 3, 224, 224]
print(labels.shape)  # [64]
```

The first dimension is the batch size. If the dataset size is not divisible by the batch size, the last batch may be smaller.

### The Classifier Model

A classifier maps an image batch to class logits.

For example, a small convolutional classifier may be written as:

```python
import torch
import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)      # [B, 128, 1, 1]
        x = x.flatten(1)          # [B, 128]
        x = self.classifier(x)    # [B, num_classes]
        return x
```

The final layer produces logits. It should not apply softmax during training when using `CrossEntropyLoss`, because that loss function internally combines `log_softmax` and negative log likelihood.

```python
model = SmallCNN(num_classes=10)
logits = model(images)

print(logits.shape)  # [B, 10]
```

### Cross-Entropy Loss

For single-label classification, the standard loss is cross-entropy.

Given logits

$$
z \in \mathbb{R}^{K},
$$

the softmax probability for class $k$ is

$$
p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.
$$

If the true class is $y$, the cross-entropy loss is

$$
L = -\log p_y.
$$

For a batch, the loss is usually averaged across examples.

In PyTorch:

```python
loss_fn = nn.CrossEntropyLoss()

logits = model(images)      # [B, K]
loss = loss_fn(logits, labels)
```

The required shapes are:

```python
logits.shape  # [B, K]
labels.shape  # [B]
```

The labels must contain class indices:

```python
labels.dtype  # torch.int64
```

A common mistake is to pass one-hot labels into `CrossEntropyLoss`. For standard use, integer labels are expected.

### Accuracy

Accuracy is the fraction of examples whose predicted class equals the true class.

The predicted class is

$$
\hat{y} = \arg\max_k z_k.
$$

In PyTorch:

```python
preds = logits.argmax(dim=1)
correct = (preds == labels).sum().item()
total = labels.numel()
accuracy = correct / total
```

For a full validation loop:

```python
def evaluate(model, loader, device):
    model.eval()

    total_loss = 0.0
    total_correct = 0
    total_count = 0

    loss_fn = nn.CrossEntropyLoss()

    with torch.no_grad():
        for images, labels in loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

    avg_loss = total_loss / total_count
    avg_acc = total_correct / total_count

    return avg_loss, avg_acc
```

`model.eval()` changes the behavior of layers such as dropout and batch normalization. `torch.no_grad()` disables gradient tracking and reduces memory use.

### Training Loop

A complete training loop includes training, validation, and checkpointing.

```python
import torch

def train_classifier(
    model,
    train_loader,
    val_loader,
    optimizer,
    loss_fn,
    device,
    epochs,
    checkpoint_path,
):
    best_val_acc = 0.0

    model.to(device)

    for epoch in range(epochs):
        model.train()

        total_loss = 0.0
        total_correct = 0
        total_count = 0

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

        train_loss = total_loss / total_count
        train_acc = total_correct / total_count

        val_loss, val_acc = evaluate(model, val_loader, device)

        print(
            f"epoch={epoch + 1} "
            f"train_loss={train_loss:.4f} "
            f"train_acc={train_acc:.4f} "
            f"val_loss={val_loss:.4f} "
            f"val_acc={val_acc:.4f}"
        )

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(
                {
                    "model": model.state_dict(),
                    "optimizer": optimizer.state_dict(),
                    "epoch": epoch,
                    "val_acc": val_acc,
                },
                checkpoint_path,
            )
```

The validation accuracy is used to choose the best checkpoint. The training accuracy alone is insufficient, because a model may memorize the training set while performing poorly on unseen images.

### Optimizer and Scheduler

A basic optimizer is stochastic gradient descent with momentum:

```python
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)
```

A common alternative is AdamW:

```python
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)
```

A scheduler changes the learning rate during training:

```python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=epochs,
)
```

The training loop then calls:

```python
scheduler.step()
```

usually once per epoch, depending on the scheduler.

The optimizer controls how parameters move. The scheduler controls how large the updates are over time. Large learning rates often help early exploration. Smaller learning rates often help late-stage convergence.

### Inference

Inference runs a trained model on new images.

```python
from PIL import Image

def predict_image(model, image_path, transform, class_names, device):
    model.eval()

    image = Image.open(image_path).convert("RGB")
    x = transform(image)
    x = x.unsqueeze(0).to(device)  # [1, C, H, W]

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        confidence, pred = probs.max(dim=1)

    return {
        "class": class_names[pred.item()],
        "confidence": confidence.item(),
    }
```

The call to `unsqueeze(0)` adds the batch dimension. A single image has shape `[C, H, W]`; the model expects `[B, C, H, W]`.

During inference, the transform should match validation preprocessing. Random training augmentations should not be used for ordinary prediction.

### Common Failure Modes

Classification pipelines often fail for mundane reasons.

| Problem | Typical cause |
|---|---|
| Training loss does not decrease | Wrong labels, learning rate too high, frozen parameters |
| Training accuracy high, validation accuracy low | Overfitting, data leakage, weak augmentation |
| Validation accuracy unstable | Small validation set, random validation transforms |
| Runtime shape error | Missing batch axis, wrong image layout |
| Poor transfer learning result | Wrong normalization, bad learning rate |
| Predictions mapped to wrong names | Class index mapping changed |
| GPU underused | Slow data loading, small batch size |
| Loss is NaN | Learning rate too high, bad input values, unstable model |

The most useful debugging habit is to inspect one batch before training:

```python
images, labels = next(iter(train_loader))

print(images.shape)
print(images.dtype)
print(images.min().item(), images.max().item())
print(labels.shape)
print(labels[:10])
```

This confirms that the data has the expected shape, type, scale, and labels.

### Minimal End-to-End Example

The following example puts the pieces together.

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

device = "cuda" if torch.cuda.is_available() else "cpu"

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

train_set = datasets.ImageFolder("dataset/train", transform=train_transform)
val_set = datasets.ImageFolder("dataset/val", transform=val_transform)

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

num_classes = len(train_set.classes)

model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, num_classes)
model = model.to(device)

loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)

train_classifier(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    device=device,
    epochs=20,
    checkpoint_path="classifier.pt",
)
```

This is a complete supervised classification pipeline. It loads images, prepares batches, defines a model, trains with cross-entropy, evaluates on validation data, and saves the best checkpoint.

### Design Principles

A robust classification pipeline follows a few rules.

Keep training transforms and validation transforms separate. Use randomness during training, but keep validation deterministic.

Store and reuse the class-index mapping. A trained model only outputs integer class indices. The mapping gives those indices semantic meaning.

Inspect tensor shapes at every boundary. The most common expected image batch shape in PyTorch is `[B, C, H, W]`.

Use logits for training. Apply softmax only when probabilities are needed for reporting or inference.

Evaluate on held-out data. Training metrics measure optimization progress. Validation metrics measure generalization.

Save checkpoints with model state, optimizer state, epoch number, and validation metric. A checkpoint should allow training to resume and allow the best model to be recovered.

A classification pipeline is a controlled experiment. The code should make the data, transforms, model, objective, optimizer, and evaluation protocol visible. This is what makes the result interpretable and reproducible.

