Skip to content

Classification Pipelines

Image classification assigns one label, or a small set of labels, to an image.

Image classification assigns one label, or a small set of labels, to an image. A model receives an image tensor as input and produces class scores as output. The class with the largest score is usually taken as the prediction.

A classification pipeline is the complete path from raw image files to trained model predictions. It includes data storage, preprocessing, batching, model definition, loss computation, optimization, validation, checkpointing, and inference.

In PyTorch, a good pipeline separates these concerns clearly. The dataset reads examples. The transform prepares tensors. The data loader builds batches. The model computes predictions. The loss function measures error. The optimizer updates parameters. The training loop coordinates the process.

The Classification Problem

Let an image be represented by a tensor

xRC×H×W, x \in \mathbb{R}^{C \times H \times W},

where CC is the number of channels, HH is height, and WW is width.

For RGB images,

C=3. C = 3.

A classification dataset contains pairs

(xi,yi), (x_i, y_i),

where xix_i is an image and yiy_i is its class label. If there are KK classes, then

yi{0,1,,K1}. y_i \in \{0,1,\ldots,K-1\}.

A neural network classifier defines a function

fθ:RC×H×WRK, f_\theta : \mathbb{R}^{C \times H \times W} \to \mathbb{R}^{K},

where θ\theta denotes the learned parameters.

The output

z=fθ(x) z = f_\theta(x)

is a vector of logits. A logit is an unnormalized class score. Larger logits correspond to more likely classes, but logits are not probabilities.

For a batch of BB images, the input has shape

XRB×C×H×W, X \in \mathbb{R}^{B \times C \times H \times W},

and the output logits have shape

ZRB×K. Z \in \mathbb{R}^{B \times K}.

In PyTorch, this means:

images.shape  # [B, C, H, W]
logits.shape  # [B, K]
labels.shape  # [B]

The labels are stored as integer class indices, not as one-hot vectors, when using torch.nn.CrossEntropyLoss.

Components of a Pipeline

A typical image classification pipeline has these stages:

StageResponsibility
DatasetLocate images and labels
TransformConvert images into normalized tensors
DataLoaderBuild shuffled mini-batches
ModelMap images to class logits
LossCompare logits with labels
OptimizerUpdate model parameters
SchedulerAdjust learning rate during training
EvaluatorMeasure validation performance
CheckpointSave model state
Inference codeRun predictions on new images

The pipeline should make each stage explicit. This reduces hidden coupling and makes experiments easier to repeat.

A minimal PyTorch training pipeline has this form:

for images, labels in train_loader:
    images = images.to(device)
    labels = labels.to(device)

    logits = model(images)
    loss = loss_fn(logits, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

This loop is small, but it contains the essential training mechanism. The model computes logits. The loss computes a scalar error. Backpropagation computes gradients. The optimizer updates parameters.

Dataset Layout

A common image dataset layout stores one directory per class:

dataset/
  train/
    cat/
      img001.jpg
      img002.jpg
    dog/
      img003.jpg
      img004.jpg
  val/
    cat/
      img101.jpg
    dog/
      img102.jpg

PyTorch can read this layout with torchvision.datasets.ImageFolder.

from torchvision.datasets import ImageFolder

train_set = ImageFolder(
    root="dataset/train",
    transform=train_transform,
)

val_set = ImageFolder(
    root="dataset/val",
    transform=val_transform,
)

ImageFolder assigns an integer index to each class based on directory names. The mapping is stored in:

train_set.class_to_idx

For example:

{
    "cat": 0,
    "dog": 1,
}

This mapping matters. The model outputs logits in this class-index order. If the mapping changes between training and inference, predictions will be interpreted incorrectly.

Image Transforms

Raw images have different sizes, color encodings, and numeric ranges. Neural networks require tensors with consistent shape and scale.

A standard validation transform resizes the image, crops it, converts it to a tensor, and normalizes the channels:

from torchvision import transforms

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

ToTensor() converts an image from integer pixel values in [0,255][0,255] to floating-point values in [0,1][0,1]. Normalize then applies channel-wise normalization:

xc,h,w=xc,h,wμcσc. x'_{c,h,w} = \frac{x_{c,h,w} - \mu_c}{\sigma_c}.

Here μc\mu_c and σc\sigma_c are the mean and standard deviation for channel cc.

Training transforms usually include random augmentation:

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

The validation transform should be deterministic. The training transform may be random. This distinction is important because validation should measure the model, not randomness in preprocessing.

DataLoaders

A DataLoader turns a dataset into mini-batches.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

For training, shuffle=True prevents the model from seeing examples in a fixed order. For validation, shuffle=False makes evaluation deterministic.

The data loader returns batches:

images, labels = next(iter(train_loader))

print(images.shape)  # [64, 3, 224, 224]
print(labels.shape)  # [64]

The first dimension is the batch size. If the dataset size is not divisible by the batch size, the last batch may be smaller.

The Classifier Model

A classifier maps an image batch to class logits.

For example, a small convolutional classifier may be written as:

import torch
import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)      # [B, 128, 1, 1]
        x = x.flatten(1)          # [B, 128]
        x = self.classifier(x)    # [B, num_classes]
        return x

The final layer produces logits. It should not apply softmax during training when using CrossEntropyLoss, because that loss function internally combines log_softmax and negative log likelihood.

model = SmallCNN(num_classes=10)
logits = model(images)

print(logits.shape)  # [B, 10]

Cross-Entropy Loss

For single-label classification, the standard loss is cross-entropy.

Given logits

zRK, z \in \mathbb{R}^{K},

the softmax probability for class kk is

pk=exp(zk)j=1Kexp(zj). p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K}\exp(z_j)}.

If the true class is yy, the cross-entropy loss is

L=logpy. L = -\log p_y.

For a batch, the loss is usually averaged across examples.

In PyTorch:

loss_fn = nn.CrossEntropyLoss()

logits = model(images)      # [B, K]
loss = loss_fn(logits, labels)

The required shapes are:

logits.shape  # [B, K]
labels.shape  # [B]

The labels must contain class indices:

labels.dtype  # torch.int64

A common mistake is to pass one-hot labels into CrossEntropyLoss. For standard use, integer labels are expected.

Accuracy

Accuracy is the fraction of examples whose predicted class equals the true class.

The predicted class is

y^=argmaxkzk. \hat{y} = \arg\max_k z_k.

In PyTorch:

preds = logits.argmax(dim=1)
correct = (preds == labels).sum().item()
total = labels.numel()
accuracy = correct / total

For a full validation loop:

def evaluate(model, loader, device):
    model.eval()

    total_loss = 0.0
    total_correct = 0
    total_count = 0

    loss_fn = nn.CrossEntropyLoss()

    with torch.no_grad():
        for images, labels in loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

    avg_loss = total_loss / total_count
    avg_acc = total_correct / total_count

    return avg_loss, avg_acc

model.eval() changes the behavior of layers such as dropout and batch normalization. torch.no_grad() disables gradient tracking and reduces memory use.

Training Loop

A complete training loop includes training, validation, and checkpointing.

import torch

def train_classifier(
    model,
    train_loader,
    val_loader,
    optimizer,
    loss_fn,
    device,
    epochs,
    checkpoint_path,
):
    best_val_acc = 0.0

    model.to(device)

    for epoch in range(epochs):
        model.train()

        total_loss = 0.0
        total_correct = 0
        total_count = 0

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.to(device)

            logits = model(images)
            loss = loss_fn(logits, labels)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            preds = logits.argmax(dim=1)

            total_loss += loss.item() * labels.size(0)
            total_correct += (preds == labels).sum().item()
            total_count += labels.size(0)

        train_loss = total_loss / total_count
        train_acc = total_correct / total_count

        val_loss, val_acc = evaluate(model, val_loader, device)

        print(
            f"epoch={epoch + 1} "
            f"train_loss={train_loss:.4f} "
            f"train_acc={train_acc:.4f} "
            f"val_loss={val_loss:.4f} "
            f"val_acc={val_acc:.4f}"
        )

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(
                {
                    "model": model.state_dict(),
                    "optimizer": optimizer.state_dict(),
                    "epoch": epoch,
                    "val_acc": val_acc,
                },
                checkpoint_path,
            )

The validation accuracy is used to choose the best checkpoint. The training accuracy alone is insufficient, because a model may memorize the training set while performing poorly on unseen images.

Optimizer and Scheduler

A basic optimizer is stochastic gradient descent with momentum:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)

A common alternative is AdamW:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)

A scheduler changes the learning rate during training:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=epochs,
)

The training loop then calls:

scheduler.step()

usually once per epoch, depending on the scheduler.

The optimizer controls how parameters move. The scheduler controls how large the updates are over time. Large learning rates often help early exploration. Smaller learning rates often help late-stage convergence.

Inference

Inference runs a trained model on new images.

from PIL import Image

def predict_image(model, image_path, transform, class_names, device):
    model.eval()

    image = Image.open(image_path).convert("RGB")
    x = transform(image)
    x = x.unsqueeze(0).to(device)  # [1, C, H, W]

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        confidence, pred = probs.max(dim=1)

    return {
        "class": class_names[pred.item()],
        "confidence": confidence.item(),
    }

The call to unsqueeze(0) adds the batch dimension. A single image has shape [C, H, W]; the model expects [B, C, H, W].

During inference, the transform should match validation preprocessing. Random training augmentations should not be used for ordinary prediction.

Common Failure Modes

Classification pipelines often fail for mundane reasons.

ProblemTypical cause
Training loss does not decreaseWrong labels, learning rate too high, frozen parameters
Training accuracy high, validation accuracy lowOverfitting, data leakage, weak augmentation
Validation accuracy unstableSmall validation set, random validation transforms
Runtime shape errorMissing batch axis, wrong image layout
Poor transfer learning resultWrong normalization, bad learning rate
Predictions mapped to wrong namesClass index mapping changed
GPU underusedSlow data loading, small batch size
Loss is NaNLearning rate too high, bad input values, unstable model

The most useful debugging habit is to inspect one batch before training:

images, labels = next(iter(train_loader))

print(images.shape)
print(images.dtype)
print(images.min().item(), images.max().item())
print(labels.shape)
print(labels[:10])

This confirms that the data has the expected shape, type, scale, and labels.

Minimal End-to-End Example

The following example puts the pieces together.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

device = "cuda" if torch.cuda.is_available() else "cpu"

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

train_set = datasets.ImageFolder("dataset/train", transform=train_transform)
val_set = datasets.ImageFolder("dataset/val", transform=val_transform)

train_loader = DataLoader(
    train_set,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

val_loader = DataLoader(
    val_set,
    batch_size=64,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

num_classes = len(train_set.classes)

model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, num_classes)
model = model.to(device)

loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-4,
)

train_classifier(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    device=device,
    epochs=20,
    checkpoint_path="classifier.pt",
)

This is a complete supervised classification pipeline. It loads images, prepares batches, defines a model, trains with cross-entropy, evaluates on validation data, and saves the best checkpoint.

Design Principles

A robust classification pipeline follows a few rules.

Keep training transforms and validation transforms separate. Use randomness during training, but keep validation deterministic.

Store and reuse the class-index mapping. A trained model only outputs integer class indices. The mapping gives those indices semantic meaning.

Inspect tensor shapes at every boundary. The most common expected image batch shape in PyTorch is [B, C, H, W].

Use logits for training. Apply softmax only when probabilities are needed for reporting or inference.

Evaluate on held-out data. Training metrics measure optimization progress. Validation metrics measure generalization.

Save checkpoints with model state, optimizer state, epoch number, and validation metric. A checkpoint should allow training to resume and allow the best model to be recovered.

A classification pipeline is a controlled experiment. The code should make the data, transforms, model, objective, optimizer, and evaluation protocol visible. This is what makes the result interpretable and reproducible.