# CNN Architectures

A convolutional neural network architecture defines how convolutional layers, activation functions, normalization layers, pooling layers, residual paths, and classifier heads are arranged. The architecture determines the flow of tensors through the model.

A CNN usually follows a staged design. Early layers operate on large spatial maps with few channels. Later layers operate on smaller spatial maps with more channels. This design gradually trades spatial resolution for semantic abstraction.

### The Basic CNN Pattern

A simple CNN block has the form

$$
\text{convolution} \rightarrow \text{activation} \rightarrow \text{pooling}.
$$

In PyTorch:

```python
import torch
import torch.nn as nn

block = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

x = torch.randn(8, 3, 64, 64)
y = block(x)

print(y.shape)  # torch.Size([8, 32, 32, 32])
```

The convolution extracts local features. The activation adds nonlinearity. The pooling layer reduces spatial size.

A full classifier stacks several blocks:

```python
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),

    nn.Flatten(),
    nn.Linear(128, 10),
)
```

For an input of shape

$$
[B, 3, 64, 64],
$$

the shape flow is

$$
[B, 3, 64, 64]
\rightarrow
[B, 32, 32, 32]
\rightarrow
[B, 64, 16, 16]
\rightarrow
[B, 128, 1, 1]
\rightarrow
[B, 128]
\rightarrow
[B, 10].
$$

The final tensor contains class logits.

### LeNet-Style Networks

LeNet is one of the earliest CNN architectures. It was designed for digit recognition. Its structure is simple:

$$
\text{conv} \rightarrow \text{pool} \rightarrow \text{conv} \rightarrow \text{pool} \rightarrow \text{linear}.
$$

A PyTorch version for grayscale digit images:

```python
class LeNetLike(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5),
            nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            nn.Conv2d(6, 16, kernel_size=5),
            nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 4 * 4, 120),
            nn.Tanh(),
            nn.Linear(120, 84),
            nn.Tanh(),
            nn.Linear(84, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)
```

LeNet-style models show the core CNN idea: local filters, spatial downsampling, and a classifier head. Modern CNNs use the same principle but with deeper networks, ReLU-like activations, normalization, residual connections, and larger datasets.

### AlexNet-Style Networks

AlexNet showed that deep CNNs could scale to large image classification tasks. It used larger convolutional layers, ReLU activations, dropout, and GPU training.

A simplified AlexNet-style structure is:

$$
\text{large conv} \rightarrow \text{pool}
\rightarrow \text{conv blocks}
\rightarrow \text{pool}
\rightarrow \text{large linear head}.
$$

Key design ideas:

| Component | Role |
|---|---|
| Large early kernel | Capture broad low-level structure |
| ReLU | Faster optimization than saturating activations |
| Max pooling | Downsample feature maps |
| Dropout | Regularize large linear layers |
| Data augmentation | Reduce overfitting |

A simplified PyTorch block:

```python
alex_block = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
)
```

Large early kernels are less common in many later CNNs, which often prefer stacks of smaller $3 \times 3$ kernels.

### VGG-Style Networks

VGG networks use a simple rule: stack many $3 \times 3$ convolutions, then downsample with pooling. The architecture is regular and easy to understand.

A VGG-style block:

```python
def vgg_block(in_channels, out_channels, num_convs):
    layers = []

    for i in range(num_convs):
        layers.append(nn.Conv2d(
            in_channels if i == 0 else out_channels,
            out_channels,
            kernel_size=3,
            padding=1,
        ))
        layers.append(nn.ReLU())

    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)
```

A small VGG-like model:

```python
model = nn.Sequential(
    vgg_block(3, 64, 2),
    vgg_block(64, 128, 2),
    vgg_block(128, 256, 3),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(256, 10),
)
```

The main idea is that several small kernels can replace one large kernel. Two $3 \times 3$ convolutions have an effective receptive field of $5 \times 5$. Three have an effective receptive field of $7 \times 7$. This gives more nonlinearities and fewer parameters than a single large convolution.

### Network Stages

Most CNNs are divided into stages. A stage is a sequence of blocks operating at the same spatial resolution.

A common pattern is:

| Stage | Example shape | Role |
|---|---|---|
| Stem | $[B, 64, 112, 112]$ | Early feature extraction |
| Stage 1 | $[B, 64, 56, 56]$ | Low-level features |
| Stage 2 | $[B, 128, 28, 28]$ | Mid-level features |
| Stage 3 | $[B, 256, 14, 14]$ | Object parts |
| Stage 4 | $[B, 512, 7, 7]$ | High-level semantics |
| Head | $[B, K]$ | Class logits |

When entering a new stage, the model usually reduces spatial size and increases channel count.

This design keeps compute manageable. As $H$ and $W$ shrink, $C$ can grow.

### Residual Networks

Very deep networks are hard to optimize if each layer must directly learn a full transformation. Residual networks add skip connections, allowing a block to learn a residual correction.

A residual block computes

$$
y = x + F(x),
$$

where $F(x)$ is a small neural network, usually made of convolutions, normalization, and activations.

A basic residual block:

```python
class BasicResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
            nn.ReLU(),

            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(x + self.block(x))
```

The skip connection provides a direct path for activations and gradients. This improves optimization and allows much deeper CNNs.

### Projection Shortcuts

A residual addition requires matching shapes. If $x$ and $F(x)$ have different channel counts or spatial sizes, the model uses a projection shortcut.

For example:

```python
class ProjectionResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),

            nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.block(x) + self.shortcut(x))
```

The $1 \times 1$ convolution changes the channel dimension. The stride changes the spatial size.

### Bottleneck Blocks

A bottleneck block reduces computation by using $1 \times 1$ convolutions before and after a $3 \times 3$ convolution.

The pattern is:

$$
1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1.
$$

The first $1 \times 1$ layer reduces channels. The $3 \times 3$ layer processes spatial structure. The final $1 \times 1$ layer expands channels.

```python
class BottleneckBlock(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels, stride=1):
        super().__init__()

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, mid_channels, kernel_size=3,
                      stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.block(x) + self.shortcut(x))
```

Bottleneck blocks are used in deeper residual networks because they reduce cost while preserving representational power.

### Inception-Style Multi-Branch Networks

Inception-style networks process the same input with multiple branches, then concatenate the outputs along the channel axis.

A simplified block might contain:

$$
1 \times 1,\quad 3 \times 3,\quad 5 \times 5,\quad \text{pooling}.
$$

Each branch captures a different scale of local structure.

```python
class InceptionLikeBlock(nn.Module):
    def __init__(self, in_channels):
        super().__init__()

        self.branch1 = nn.Conv2d(in_channels, 32, kernel_size=1)

        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
        )

        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=5, padding=2),
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, 32, kernel_size=1),
        )

    def forward(self, x):
        outputs = [
            self.branch1(x),
            self.branch2(x),
            self.branch3(x),
            self.branch4(x),
        ]
        return torch.cat(outputs, dim=1)
```

If all branches preserve height and width, concatenation is valid along the channel dimension.

### Depthwise Separable Convolution

A standard convolution mixes spatial and channel information in one operation. A depthwise separable convolution separates them into two steps.

First, a depthwise convolution applies one spatial filter per input channel. Then a pointwise $1 \times 1$ convolution mixes channels.

```python
class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(
                in_channels,
                in_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=in_channels,
                bias=False,
            ),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),

            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)
```

This reduces parameter count and computation.

For a standard $3 \times 3$ convolution:

$$
\text{params} = C_{\text{in}} C_{\text{out}} 3 3.
$$

For a depthwise separable version:

$$
\text{params} = C_{\text{in}} 3 3 + C_{\text{in}} C_{\text{out}}.
$$

This design appears in efficient CNN families such as MobileNet-style architectures.

### Dense Connections

Dense networks connect each layer to all later layers inside a block. Instead of adding features as in residual networks, they concatenate them.

If a block has intermediate outputs

$$
x_0, x_1, x_2,\ldots,
$$

then a later layer receives

$$
[x_0, x_1, x_2,\ldots,x_k]
$$

as input.

This encourages feature reuse. Early features remain directly available to later layers. The cost is increased channel growth, so dense architectures often use bottleneck and compression layers.

### Efficient CNN Families

Efficient CNN architectures aim to improve accuracy per unit of compute. Common techniques include:

| Technique | Purpose |
|---|---|
| Depthwise separable convolution | Reduce convolution cost |
| Inverted residual blocks | Improve efficiency in narrow networks |
| Squeeze-and-excitation | Reweight channels adaptively |
| Compound scaling | Scale depth, width, and resolution together |
| Stochastic depth | Regularize very deep models |

An inverted residual block expands channels, applies a depthwise convolution, then projects back to fewer channels. This is common in mobile CNNs.

### CNN Classifier Head

A modern CNN classifier often ends with:

$$
\text{global average pooling} \rightarrow \text{linear classifier}.
$$

In PyTorch:

```python
head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(512, 1000),
)
```

This avoids a large fully connected layer over all spatial positions. It also allows the network to accept different input sizes, as long as the convolutional body can process them.

### Designing a Small CNN

A practical small CNN for $32 \times 32$ images:

```python
class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),

            nn.Conv2d(32, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),

            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)
```

Shape flow for input $[B,3,32,32]$:

| Step | Shape |
|---|---|
| Input | $[B, 3, 32, 32]$ |
| First block | $[B, 32, 16, 16]$ |
| Second block | $[B, 64, 8, 8]$ |
| Final conv | $[B, 128, 8, 8]$ |
| Global average pool | $[B, 128, 1, 1]$ |
| Flatten | $[B, 128]$ |
| Linear | $[B, 10]$ |

### Architecture Choice

Architecture choice depends on the task.

| Task | Architecture preference |
|---|---|
| Small image classification | Simple CNN, ResNet-like CNN |
| Large image classification | ResNet, ConvNeXt, EfficientNet-style CNN |
| Mobile inference | MobileNet-style CNN |
| Segmentation | Encoder-decoder CNN, U-Net, DeepLab-style models |
| Detection | CNN backbone with feature pyramid |
| Image restoration | Residual CNN, U-Net |
| Image generation | U-Net, diffusion backbone, hybrid CNN-transformer |

The best architecture is shaped by input size, compute budget, latency requirement, dataset size, and output structure.

### Summary

CNN architectures arrange convolutional blocks into stages. Early stages preserve spatial detail. Later stages use smaller feature maps and richer channels. Classifier heads usually summarize the final feature maps into logits.

Classical architectures introduced core design patterns: LeNet introduced local convolution and pooling, AlexNet showed large-scale deep CNN training, VGG emphasized simple stacked $3 \times 3$ convolutions, Inception used multi-branch scale processing, ResNet introduced skip connections, and efficient CNNs reduced compute with separable convolutions and careful scaling.

Modern CNN design is built from a small set of reusable ideas: local filters, channel growth, spatial downsampling, normalization, nonlinear activation, residual paths, and efficient classifier heads.

