Classification Pipelines

Image classification assigns one label, or a small set of labels, to an image. A model receives an image tensor as input and produces class scores as output. The class with the largest score is usually taken as the prediction.

A classification pipeline is the complete path from raw image files to trained model predictions. It includes data storage, preprocessing, batching, model definition, loss computation, optimization, validation, checkpointing, and inference.

In PyTorch, a good pipeline separates these concerns clearly. The dataset reads examples. The transform prepares tensors. The data loader builds batches. The model computes predictions. The loss function measures error. The optimizer updates parameters. The training loop coordinates the process.

The Classification Problem

Let an image be represented by a tensor

x \in \mathbb{R}^{C \times H \times W},

where $C$ is the number of channels, $H$ is height, and $W$ is width.

For RGB images,

C = 3.

A classification dataset contains pairs

(x_i, y_i),

where $x_i$ is an image and $y_i$ is its class label. If there are $K$ classes, then

y_i \in \{0,1,\ldots,K-1\}.

A neural network classifier defines a function

f_\theta : \mathbb{R}^{C \times H \times W} \to \mathbb{R}^{K},

where $\theta$ denotes the learned parameters.

The output

z = f_\theta(x)

is a vector of logits. A logit is an unnormalized class score. Larger logits correspond to more likely classes, but logits are not probabilities.

For a batch of $B$ images, the input has shape

X \in \mathbb{R}^{B \times C \times H \times W},

and the output logits have shape

Z \in \mathbb{R}^{B \times K}.

In PyTorch, this means:

images.shape  # [B, C, H, W]
logits.shape  # [B, K]
labels.shape  # [B]

The labels are stored as integer class indices, not as one-hot vectors, when using torch.nn.CrossEntropyLoss.

Components of a Pipeline

A typical image classification pipeline has these stages:

Stage	Responsibility
Dataset	Locate images and labels
Transform	Convert images into normalized tensors
DataLoader	Build shuffled mini-batches
Model	Map images to class logits
Loss	Compare logits with labels
Optimizer	Update model parameters
Scheduler	Adjust learning rate during training
Evaluator	Measure validation performance
Checkpoint	Save model state
Inference code	Run predictions on new images

The pipeline should make each stage explicit. This reduces hidden coupling and makes experiments easier to repeat.

A minimal PyTorch training pipeline has this form:

for images, labels in train_loader:
    images = images.to(device)
    labels = labels.to(device)

    logits = model(images)
    loss = loss_fn(logits, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

This loop is small, but it contains the essential training mechanism. The model computes logits. The loss computes a scalar error. Backpropagation computes gradients. The optimizer updates parameters.

Dataset Layout

A common image dataset layout stores one directory per class:

dataset/
  train/
    cat/
      img001.jpg
      img002.jpg
    dog/
      img003.jpg
      img004.jpg
  val/
    cat/
      img101.jpg
    dog/
      img102.jpg

PyTorch can read this layout with torchvision.datasets.ImageFolder.

from torchvision.datasets import ImageFolder

train_set = ImageFolder(
    root="dataset/train",
    transform=train_transform,
)

val_set = ImageFolder(
    root="dataset/val",
    transform=val_transform,
)

ImageFolder assigns an integer index to each class based on directory names. The mapping is stored in:

train_set.class_to_idx

For example:

{
    "cat": 0,
    "dog": 1,
}

This mapping matters. The model outputs logits in this class-index order. If the mapping changes between training and inference, predictions will be interpreted incorrectly.

Image Transforms

Raw images have different sizes, color encodings, and numeric ranges. Neural networks require tensors with consistent shape and scale.

A standard validation transform resizes the image, crops it, converts it to a tensor, and normalizes the channels:

from torchvision import transforms

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

ToTensor() converts an image from integer pixel values in $[0,255]$ to floating-point values in $[0,1]$ . Normalize then applies channel-wise normalization:

x'_{c,h,w} = \frac{x_{c,h,w} - \mu_c}{\sigma_c}.

Here $\mu_c$ and $\sigma_c$ are the mean and standard deviation for channel $c$ .

Training transforms usually include random augmentation:

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

The validation transform should be deterministic. The training transform may be random. This distinction is important because validation should measure the model, not randomness in preprocessing.

DataLoaders

A DataLoader turns a dataset into mini-batches.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

For training, shuffle=True prevents the model from seeing examples in a fixed order. For validation, shuffle=False makes evaluation deterministic.

The data loader returns batches:

images, labels = next(iter(train_loader))

print(images.shape)  # [64, 3, 224, 224]
print(labels.shape)  # [64]

The first dimension is the batch size. If the dataset size is not divisible by the batch size, the last batch may be smaller.

The Classifier Model

A classifier maps an image batch to class logits.

For example, a small convolutional classifier may be written as:

import torch
import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)      # [B, 128, 1, 1]
        x = x.flatten(1)          # [B, 128]
        x = self.classifier(x)    # [B, num_classes]
        return x

The final layer produces logits. It should not apply softmax during training when using CrossEntropyLoss, because that loss function internally combines log_softmax and negative log likelihood.

model = SmallCNN(num_classes=10)
logits = model(images)

print(logits.shape)  # [B, 10]

Cross-Entropy Loss

For single-label classification, the standard loss is cross-entropy.

Given logits

z \in \mathbb{R}^{K},

the softmax probability for class $k$ is

p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.

If the true class is $y$ , the cross-entropy loss is

L = -\log p_y.

For a batch, the loss is usually averaged across examples.

In PyTorch:

loss_fn = nn.CrossEntropyLoss()

logits = model(images)      # [B, K]
loss = loss_fn(logits, labels)

The required shapes are:

logits.shape  # [B, K]
labels.shape  # [B]

The labels must contain class indices:

labels.dtype  # torch.int64

A common mistake is to pass one-hot labels into CrossEntropyLoss. For standard use, integer labels are expected.

Accuracy

Accuracy is the fraction of examples whose predicted class equals the true class.

The predicted class is

\hat{y} = \arg\max_k z_k.

In PyTorch:

preds = logits.argmax(dim=1)
correct = (preds == labels).sum().item()
total = labels.numel()
accuracy = correct / total

For a full validation loop:

def evaluate(model, loader, device):
    model.eval()

    total_loss = 0.0
    total_correct = 0
    total_count = 0

    loss_fn = nn.CrossEntropyLoss()

    with torch.no_grad():
        for images, labels in loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

    avg_loss = total_loss / total_count
    avg_acc = total_correct / total_count

    return avg_loss, avg_acc

model.eval() changes the behavior of layers such as dropout and batch normalization. torch.no_grad() disables gradient tracking and reduces memory use.

Training Loop

A complete training loop includes training, validation, and checkpointing.

import torch

def train_classifier(
    model,
    train_loader,
    val_loader,
    optimizer,
    loss_fn,
    device,
    epochs,
    checkpoint_path,
):
    best_val_acc = 0.0

    model.to(device)

    for epoch in range(epochs):
        model.train()

        total_loss = 0.0
        total_correct = 0
        total_count = 0

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

        train_loss = total_loss / total_count
        train_acc = total_correct / total_count

        val_loss, val_acc = evaluate(model, val_loader, device)

        print(
            f"epoch={epoch + 1} "
            f"train_loss={train_loss:.4f} "
            f"train_acc={train_acc:.4f} "
            f"val_loss={val_loss:.4f} "
            f"val_acc={val_acc:.4f}"
        )

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(
                {
                    "model": model.state_dict(),
                    "optimizer": optimizer.state_dict(),
                    "epoch": epoch,
                    "val_acc": val_acc,
                },
                checkpoint_path,
            )

The validation accuracy is used to choose the best checkpoint. The training accuracy alone is insufficient, because a model may memorize the training set while performing poorly on unseen images.

Optimizer and Scheduler

A basic optimizer is stochastic gradient descent with momentum:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)

A common alternative is AdamW:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)

A scheduler changes the learning rate during training:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=epochs,
)

The training loop then calls:

scheduler.step()

usually once per epoch, depending on the scheduler.

The optimizer controls how parameters move. The scheduler controls how large the updates are over time. Large learning rates often help early exploration. Smaller learning rates often help late-stage convergence.

Inference

Inference runs a trained model on new images.

from PIL import Image

def predict_image(model, image_path, transform, class_names, device):
    model.eval()

    image = Image.open(image_path).convert("RGB")
    x = transform(image)
    x = x.unsqueeze(0).to(device)  # [1, C, H, W]

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        confidence, pred = probs.max(dim=1)

    return {
        "class": class_names[pred.item()],
        "confidence": confidence.item(),
    }

The call to unsqueeze(0) adds the batch dimension. A single image has shape [C, H, W]; the model expects [B, C, H, W].

During inference, the transform should match validation preprocessing. Random training augmentations should not be used for ordinary prediction.

Common Failure Modes

Classification pipelines often fail for mundane reasons.

Problem	Typical cause
Training loss does not decrease	Wrong labels, learning rate too high, frozen parameters
Training accuracy high, validation accuracy low	Overfitting, data leakage, weak augmentation
Validation accuracy unstable	Small validation set, random validation transforms
Runtime shape error	Missing batch axis, wrong image layout
Poor transfer learning result	Wrong normalization, bad learning rate
Predictions mapped to wrong names	Class index mapping changed
GPU underused	Slow data loading, small batch size
Loss is NaN	Learning rate too high, bad input values, unstable model

The most useful debugging habit is to inspect one batch before training:

images, labels = next(iter(train_loader))

print(images.shape)
print(images.dtype)
print(images.min().item(), images.max().item())
print(labels.shape)
print(labels[:10])

This confirms that the data has the expected shape, type, scale, and labels.

Minimal End-to-End Example

The following example puts the pieces together.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

device = "cuda" if torch.cuda.is_available() else "cpu"

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

train_set = datasets.ImageFolder("dataset/train", transform=train_transform)
val_set = datasets.ImageFolder("dataset/val", transform=val_transform)

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

num_classes = len(train_set.classes)

model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, num_classes)
model = model.to(device)

loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)

train_classifier(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    device=device,
    epochs=20,
    checkpoint_path="classifier.pt",
)

This is a complete supervised classification pipeline. It loads images, prepares batches, defines a model, trains with cross-entropy, evaluates on validation data, and saves the best checkpoint.

Design Principles

A robust classification pipeline follows a few rules.

Keep training transforms and validation transforms separate. Use randomness during training, but keep validation deterministic.

Store and reuse the class-index mapping. A trained model only outputs integer class indices. The mapping gives those indices semantic meaning.

Inspect tensor shapes at every boundary. The most common expected image batch shape in PyTorch is [B, C, H, W].

Use logits for training. Apply softmax only when probabilities are needed for reporting or inference.

Evaluate on held-out data. Training metrics measure optimization progress. Validation metrics measure generalization.

Save checkpoints with model state, optimizer state, epoch number, and validation metric. A checkpoint should allow training to resume and allow the best model to be recovered.

A classification pipeline is a controlled experiment. The code should make the data, transforms, model, objective, optimizer, and evaluation protocol visible. This is what makes the result interpretable and reproducible.