Image classification assigns one label, or a small set of labels, to an image.
Image classification assigns one label, or a small set of labels, to an image. A model receives an image tensor as input and produces class scores as output. The class with the largest score is usually taken as the prediction.
A classification pipeline is the complete path from raw image files to trained model predictions. It includes data storage, preprocessing, batching, model definition, loss computation, optimization, validation, checkpointing, and inference.
In PyTorch, a good pipeline separates these concerns clearly. The dataset reads examples. The transform prepares tensors. The data loader builds batches. The model computes predictions. The loss function measures error. The optimizer updates parameters. The training loop coordinates the process.
The Classification Problem
Let an image be represented by a tensor
where is the number of channels, is height, and is width.
For RGB images,
A classification dataset contains pairs
where is an image and is its class label. If there are classes, then
A neural network classifier defines a function
where denotes the learned parameters.
The output
is a vector of logits. A logit is an unnormalized class score. Larger logits correspond to more likely classes, but logits are not probabilities.
For a batch of images, the input has shape
and the output logits have shape
In PyTorch, this means:
images.shape # [B, C, H, W]
logits.shape # [B, K]
labels.shape # [B]The labels are stored as integer class indices, not as one-hot vectors, when using torch.nn.CrossEntropyLoss.
Components of a Pipeline
A typical image classification pipeline has these stages:
| Stage | Responsibility |
|---|---|
| Dataset | Locate images and labels |
| Transform | Convert images into normalized tensors |
| DataLoader | Build shuffled mini-batches |
| Model | Map images to class logits |
| Loss | Compare logits with labels |
| Optimizer | Update model parameters |
| Scheduler | Adjust learning rate during training |
| Evaluator | Measure validation performance |
| Checkpoint | Save model state |
| Inference code | Run predictions on new images |
The pipeline should make each stage explicit. This reduces hidden coupling and makes experiments easier to repeat.
A minimal PyTorch training pipeline has this form:
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
logits = model(images)
loss = loss_fn(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()This loop is small, but it contains the essential training mechanism. The model computes logits. The loss computes a scalar error. Backpropagation computes gradients. The optimizer updates parameters.
Dataset Layout
A common image dataset layout stores one directory per class:
dataset/
train/
cat/
img001.jpg
img002.jpg
dog/
img003.jpg
img004.jpg
val/
cat/
img101.jpg
dog/
img102.jpgPyTorch can read this layout with torchvision.datasets.ImageFolder.
from torchvision.datasets import ImageFolder
train_set = ImageFolder(
root="dataset/train",
transform=train_transform,
)
val_set = ImageFolder(
root="dataset/val",
transform=val_transform,
)ImageFolder assigns an integer index to each class based on directory names. The mapping is stored in:
train_set.class_to_idxFor example:
{
"cat": 0,
"dog": 1,
}This mapping matters. The model outputs logits in this class-index order. If the mapping changes between training and inference, predictions will be interpreted incorrectly.
Image Transforms
Raw images have different sizes, color encodings, and numeric ranges. Neural networks require tensors with consistent shape and scale.
A standard validation transform resizes the image, crops it, converts it to a tensor, and normalizes the channels:
from torchvision import transforms
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])ToTensor() converts an image from integer pixel values in to floating-point values in . Normalize then applies channel-wise normalization:
Here and are the mean and standard deviation for channel .
Training transforms usually include random augmentation:
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])The validation transform should be deterministic. The training transform may be random. This distinction is important because validation should measure the model, not randomness in preprocessing.
DataLoaders
A DataLoader turns a dataset into mini-batches.
from torch.utils.data import DataLoader
train_loader = DataLoader(
train_set,
batch_size=64,
shuffle=True,
num_workers=4,
pin_memory=True,
)
val_loader = DataLoader(
val_set,
batch_size=64,
shuffle=False,
num_workers=4,
pin_memory=True,
)For training, shuffle=True prevents the model from seeing examples in a fixed order. For validation, shuffle=False makes evaluation deterministic.
The data loader returns batches:
images, labels = next(iter(train_loader))
print(images.shape) # [64, 3, 224, 224]
print(labels.shape) # [64]The first dimension is the batch size. If the dataset size is not divisible by the batch size, the last batch may be smaller.
The Classifier Model
A classifier maps an image batch to class logits.
For example, a small convolutional classifier may be written as:
import torch
import torch.nn as nn
class SmallCNN(nn.Module):
def __init__(self, num_classes: int):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)),
)
self.classifier = nn.Linear(128, num_classes)
def forward(self, x):
x = self.features(x) # [B, 128, 1, 1]
x = x.flatten(1) # [B, 128]
x = self.classifier(x) # [B, num_classes]
return xThe final layer produces logits. It should not apply softmax during training when using CrossEntropyLoss, because that loss function internally combines log_softmax and negative log likelihood.
model = SmallCNN(num_classes=10)
logits = model(images)
print(logits.shape) # [B, 10]Cross-Entropy Loss
For single-label classification, the standard loss is cross-entropy.
Given logits
the softmax probability for class is
If the true class is , the cross-entropy loss is
For a batch, the loss is usually averaged across examples.
In PyTorch:
loss_fn = nn.CrossEntropyLoss()
logits = model(images) # [B, K]
loss = loss_fn(logits, labels)The required shapes are:
logits.shape # [B, K]
labels.shape # [B]The labels must contain class indices:
labels.dtype # torch.int64A common mistake is to pass one-hot labels into CrossEntropyLoss. For standard use, integer labels are expected.
Accuracy
Accuracy is the fraction of examples whose predicted class equals the true class.
The predicted class is
In PyTorch:
preds = logits.argmax(dim=1)
correct = (preds == labels).sum().item()
total = labels.numel()
accuracy = correct / totalFor a full validation loop:
def evaluate(model, loader, device):
model.eval()
total_loss = 0.0
total_correct = 0
total_count = 0
loss_fn = nn.CrossEntropyLoss()
with torch.no_grad():
for images, labels in loader:
images = images.to(device)
labels = labels.to(device)
logits = model(images)
loss = loss_fn(logits, labels)
preds = logits.argmax(dim=1)
total_loss += loss.item() * labels.size(0)
total_correct += (preds == labels).sum().item()
total_count += labels.size(0)
avg_loss = total_loss / total_count
avg_acc = total_correct / total_count
return avg_loss, avg_accmodel.eval() changes the behavior of layers such as dropout and batch normalization. torch.no_grad() disables gradient tracking and reduces memory use.
Training Loop
A complete training loop includes training, validation, and checkpointing.
import torch
def train_classifier(
model,
train_loader,
val_loader,
optimizer,
loss_fn,
device,
epochs,
checkpoint_path,
):
best_val_acc = 0.0
model.to(device)
for epoch in range(epochs):
model.train()
total_loss = 0.0
total_correct = 0
total_count = 0
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
logits = model(images)
loss = loss_fn(logits, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
preds = logits.argmax(dim=1)
total_loss += loss.item() * labels.size(0)
total_correct += (preds == labels).sum().item()
total_count += labels.size(0)
train_loss = total_loss / total_count
train_acc = total_correct / total_count
val_loss, val_acc = evaluate(model, val_loader, device)
print(
f"epoch={epoch + 1} "
f"train_loss={train_loss:.4f} "
f"train_acc={train_acc:.4f} "
f"val_loss={val_loss:.4f} "
f"val_acc={val_acc:.4f}"
)
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(
{
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"epoch": epoch,
"val_acc": val_acc,
},
checkpoint_path,
)The validation accuracy is used to choose the best checkpoint. The training accuracy alone is insufficient, because a model may memorize the training set while performing poorly on unseen images.
Optimizer and Scheduler
A basic optimizer is stochastic gradient descent with momentum:
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4,
)A common alternative is AdamW:
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
weight_decay=1e-4,
)A scheduler changes the learning rate during training:
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=epochs,
)The training loop then calls:
scheduler.step()usually once per epoch, depending on the scheduler.
The optimizer controls how parameters move. The scheduler controls how large the updates are over time. Large learning rates often help early exploration. Smaller learning rates often help late-stage convergence.
Inference
Inference runs a trained model on new images.
from PIL import Image
def predict_image(model, image_path, transform, class_names, device):
model.eval()
image = Image.open(image_path).convert("RGB")
x = transform(image)
x = x.unsqueeze(0).to(device) # [1, C, H, W]
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
confidence, pred = probs.max(dim=1)
return {
"class": class_names[pred.item()],
"confidence": confidence.item(),
}The call to unsqueeze(0) adds the batch dimension. A single image has shape [C, H, W]; the model expects [B, C, H, W].
During inference, the transform should match validation preprocessing. Random training augmentations should not be used for ordinary prediction.
Common Failure Modes
Classification pipelines often fail for mundane reasons.
| Problem | Typical cause |
|---|---|
| Training loss does not decrease | Wrong labels, learning rate too high, frozen parameters |
| Training accuracy high, validation accuracy low | Overfitting, data leakage, weak augmentation |
| Validation accuracy unstable | Small validation set, random validation transforms |
| Runtime shape error | Missing batch axis, wrong image layout |
| Poor transfer learning result | Wrong normalization, bad learning rate |
| Predictions mapped to wrong names | Class index mapping changed |
| GPU underused | Slow data loading, small batch size |
| Loss is NaN | Learning rate too high, bad input values, unstable model |
The most useful debugging habit is to inspect one batch before training:
images, labels = next(iter(train_loader))
print(images.shape)
print(images.dtype)
print(images.min().item(), images.max().item())
print(labels.shape)
print(labels[:10])This confirms that the data has the expected shape, type, scale, and labels.
Minimal End-to-End Example
The following example puts the pieces together.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
device = "cuda" if torch.cuda.is_available() else "cpu"
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
train_set = datasets.ImageFolder("dataset/train", transform=train_transform)
val_set = datasets.ImageFolder("dataset/val", transform=val_transform)
train_loader = DataLoader(
train_set,
batch_size=64,
shuffle=True,
num_workers=4,
pin_memory=True,
)
val_loader = DataLoader(
val_set,
batch_size=64,
shuffle=False,
num_workers=4,
pin_memory=True,
)
num_classes = len(train_set.classes)
model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, num_classes)
model = model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
weight_decay=1e-4,
)
train_classifier(
model=model,
train_loader=train_loader,
val_loader=val_loader,
optimizer=optimizer,
loss_fn=loss_fn,
device=device,
epochs=20,
checkpoint_path="classifier.pt",
)This is a complete supervised classification pipeline. It loads images, prepares batches, defines a model, trains with cross-entropy, evaluates on validation data, and saves the best checkpoint.
Design Principles
A robust classification pipeline follows a few rules.
Keep training transforms and validation transforms separate. Use randomness during training, but keep validation deterministic.
Store and reuse the class-index mapping. A trained model only outputs integer class indices. The mapping gives those indices semantic meaning.
Inspect tensor shapes at every boundary. The most common expected image batch shape in PyTorch is [B, C, H, W].
Use logits for training. Apply softmax only when probabilities are needed for reporting or inference.
Evaluate on held-out data. Training metrics measure optimization progress. Validation metrics measure generalization.
Save checkpoints with model state, optimizer state, epoch number, and validation metric. A checkpoint should allow training to resume and allow the best model to be recovered.
A classification pipeline is a controlled experiment. The code should make the data, transforms, model, objective, optimizer, and evaluation protocol visible. This is what makes the result interpretable and reproducible.