CNN Architectures

A convolutional neural network architecture defines how convolutional layers, activation functions, normalization layers, pooling layers, residual paths, and classifier heads are arranged. The architecture determines the flow of tensors through the model.

A CNN usually follows a staged design. Early layers operate on large spatial maps with few channels. Later layers operate on smaller spatial maps with more channels. This design gradually trades spatial resolution for semantic abstraction.

The Basic CNN Pattern

A simple CNN block has the form

\text{convolution} \rightarrow \text{activation} \rightarrow \text{pooling}.

In PyTorch:

import torch
import torch.nn as nn

block = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

x = torch.randn(8, 3, 64, 64)
y = block(x)

print(y.shape)  # torch.Size([8, 32, 32, 32])

The convolution extracts local features. The activation adds nonlinearity. The pooling layer reduces spatial size.

A full classifier stacks several blocks:

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),

    nn.Flatten(),
    nn.Linear(128, 10),
)

For an input of shape

[B, 3, 64, 64],

the shape flow is

[B, 3, 64, 64] \rightarrow [B, 32, 32, 32] \rightarrow [B, 64, 16, 16] \rightarrow [B, 128, 1, 1] \rightarrow [B, 128] \rightarrow [B, 10].

The final tensor contains class logits.

LeNet-Style Networks

LeNet is one of the earliest CNN architectures. It was designed for digit recognition. Its structure is simple:

\text{conv} \rightarrow \text{pool} \rightarrow \text{conv} \rightarrow \text{pool} \rightarrow \text{linear}.

A PyTorch version for grayscale digit images:

class LeNetLike(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5),
            nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            nn.Conv2d(6, 16, kernel_size=5),
            nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 4 * 4, 120),
            nn.Tanh(),
            nn.Linear(120, 84),
            nn.Tanh(),
            nn.Linear(84, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

LeNet-style models show the core CNN idea: local filters, spatial downsampling, and a classifier head. Modern CNNs use the same principle but with deeper networks, ReLU-like activations, normalization, residual connections, and larger datasets.

AlexNet-Style Networks

AlexNet showed that deep CNNs could scale to large image classification tasks. It used larger convolutional layers, ReLU activations, dropout, and GPU training.

A simplified AlexNet-style structure is:

\text{large conv} \rightarrow \text{pool} \rightarrow \text{conv blocks} \rightarrow \text{pool} \rightarrow \text{large linear head}.

Key design ideas:

Component	Role
Large early kernel	Capture broad low-level structure
ReLU	Faster optimization than saturating activations
Max pooling	Downsample feature maps
Dropout	Regularize large linear layers
Data augmentation	Reduce overfitting

A simplified PyTorch block:

alex_block = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
)

Large early kernels are less common in many later CNNs, which often prefer stacks of smaller $3 \times 3$ kernels.

VGG-Style Networks

VGG networks use a simple rule: stack many $3 \times 3$ convolutions, then downsample with pooling. The architecture is regular and easy to understand.

A VGG-style block:

def vgg_block(in_channels, out_channels, num_convs):
    layers = []

    for i in range(num_convs):
        layers.append(nn.Conv2d(
            in_channels if i == 0 else out_channels,
            out_channels,
            kernel_size=3,
            padding=1,
        ))
        layers.append(nn.ReLU())

    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)

A small VGG-like model:

model = nn.Sequential(
    vgg_block(3, 64, 2),
    vgg_block(64, 128, 2),
    vgg_block(128, 256, 3),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(256, 10),
)

The main idea is that several small kernels can replace one large kernel. Two $3 \times 3$ convolutions have an effective receptive field of $5 \times 5$ . Three have an effective receptive field of $7 \times 7$ . This gives more nonlinearities and fewer parameters than a single large convolution.

Network Stages

Most CNNs are divided into stages. A stage is a sequence of blocks operating at the same spatial resolution.

A common pattern is:

Stage	Example shape	Role
Stem	$[B, 64, 112, 112]$	Early feature extraction
Stage 1	$[B, 64, 56, 56]$	Low-level features
Stage 2	$[B, 128, 28, 28]$	Mid-level features
Stage 3	$[B, 256, 14, 14]$	Object parts
Stage 4	$[B, 512, 7, 7]$	High-level semantics
Head	$[B, K]$	Class logits

When entering a new stage, the model usually reduces spatial size and increases channel count.

This design keeps compute manageable. As $H$ and $W$ shrink, $C$ can grow.

Residual Networks

Very deep networks are hard to optimize if each layer must directly learn a full transformation. Residual networks add skip connections, allowing a block to learn a residual correction.

A residual block computes

y = x + F(x),

where $F(x)$ is a small neural network, usually made of convolutions, normalization, and activations.

A basic residual block:

class BasicResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
            nn.ReLU(),

            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(x + self.block(x))

The skip connection provides a direct path for activations and gradients. This improves optimization and allows much deeper CNNs.

Projection Shortcuts

A residual addition requires matching shapes. If $x$ and $F(x)$ have different channel counts or spatial sizes, the model uses a projection shortcut.

For example:

class ProjectionResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),

            nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.block(x) + self.shortcut(x))

The $1 \times 1$ convolution changes the channel dimension. The stride changes the spatial size.

Bottleneck Blocks

A bottleneck block reduces computation by using $1 \times 1$ convolutions before and after a $3 \times 3$ convolution.

The pattern is:

1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1.

The first $1 \times 1$ layer reduces channels. The $3 \times 3$ layer processes spatial structure. The final $1 \times 1$ layer expands channels.

class BottleneckBlock(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels, stride=1):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, mid_channels, kernel_size=3,
                      stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.block(x) + self.shortcut(x))

Bottleneck blocks are used in deeper residual networks because they reduce cost while preserving representational power.

Inception-Style Multi-Branch Networks

Inception-style networks process the same input with multiple branches, then concatenate the outputs along the channel axis.

A simplified block might contain:

1 \times 1,\quad 3 \times 3,\quad 5 \times 5,\quad \text{pooling}.

Each branch captures a different scale of local structure.

class InceptionLikeBlock(nn.Module):
    def __init__(self, in_channels):
        super().__init__()

        self.branch1 = nn.Conv2d(in_channels, 32, kernel_size=1)

        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
        )

        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=5, padding=2),
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, 32, kernel_size=1),
        )

    def forward(self, x):
        outputs = [
            self.branch1(x),
            self.branch2(x),
            self.branch3(x),
            self.branch4(x),
        ]
        return torch.cat(outputs, dim=1)

If all branches preserve height and width, concatenation is valid along the channel dimension.

Depthwise Separable Convolution

A standard convolution mixes spatial and channel information in one operation. A depthwise separable convolution separates them into two steps.

First, a depthwise convolution applies one spatial filter per input channel. Then a pointwise $1 \times 1$ convolution mixes channels.

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(
                in_channels,
                in_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=in_channels,
                bias=False,
            ),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),

            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

This reduces parameter count and computation.

For a standard $3 \times 3$ convolution:

\text{params} = C_{\text{in}} C_{\text{out}} 3 3.

For a depthwise separable version:

\text{params} = C_{\text{in}} 3 3 + C_{\text{in}} C_{\text{out}}.

This design appears in efficient CNN families such as MobileNet-style architectures.

Dense Connections

Dense networks connect each layer to all later layers inside a block. Instead of adding features as in residual networks, they concatenate them.

If a block has intermediate outputs

x_0, x_1, x_2,\ldots,

then a later layer receives

[x_0, x_1, x_2,\ldots,x_k]

as input.

This encourages feature reuse. Early features remain directly available to later layers. The cost is increased channel growth, so dense architectures often use bottleneck and compression layers.

Efficient CNN Families

Efficient CNN architectures aim to improve accuracy per unit of compute. Common techniques include:

Technique	Purpose
Depthwise separable convolution	Reduce convolution cost
Inverted residual blocks	Improve efficiency in narrow networks
Squeeze-and-excitation	Reweight channels adaptively
Compound scaling	Scale depth, width, and resolution together
Stochastic depth	Regularize very deep models

An inverted residual block expands channels, applies a depthwise convolution, then projects back to fewer channels. This is common in mobile CNNs.

CNN Classifier Head

A modern CNN classifier often ends with:

\text{global average pooling} \rightarrow \text{linear classifier}.

In PyTorch:

head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(512, 1000),
)

This avoids a large fully connected layer over all spatial positions. It also allows the network to accept different input sizes, as long as the convolutional body can process them.

Designing a Small CNN

A practical small CNN for $32 \times 32$ images:

class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),

            nn.Conv2d(32, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),

            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)

Shape flow for input $[B,3,32,32]$ :

Step	Shape
Input	$[B, 3, 32, 32]$
First block	$[B, 32, 16, 16]$
Second block	$[B, 64, 8, 8]$
Final conv	$[B, 128, 8, 8]$
Global average pool	$[B, 128, 1, 1]$
Flatten	$[B, 128]$
Linear	$[B, 10]$

Architecture Choice

Architecture choice depends on the task.

Task	Architecture preference
Small image classification	Simple CNN, ResNet-like CNN
Large image classification	ResNet, ConvNeXt, EfficientNet-style CNN
Mobile inference	MobileNet-style CNN
Segmentation	Encoder-decoder CNN, U-Net, DeepLab-style models
Detection	CNN backbone with feature pyramid
Image restoration	Residual CNN, U-Net
Image generation	U-Net, diffusion backbone, hybrid CNN-transformer

The best architecture is shaped by input size, compute budget, latency requirement, dataset size, and output structure.

Summary

CNN architectures arrange convolutional blocks into stages. Early stages preserve spatial detail. Later stages use smaller feature maps and richer channels. Classifier heads usually summarize the final feature maps into logits.

Classical architectures introduced core design patterns: LeNet introduced local convolution and pooling, AlexNet showed large-scale deep CNN training, VGG emphasized simple stacked $3 \times 3$ convolutions, Inception used multi-branch scale processing, ResNet introduced skip connections, and efficient CNNs reduced compute with separable convolutions and careful scaling.

Modern CNN design is built from a small set of reusable ideas: local filters, channel growth, spatial downsampling, normalization, nonlinear activation, residual paths, and efficient classifier heads.