Residual Networks

Residual networks are convolutional networks built from blocks with skip connections. A skip connection passes the input of a block directly to its output, usually by addition. This gives the network a direct path for information and gradients.

The central residual form is

$$ y = F(x) + x. $$

Here $x$ is the block input, $F(x)$ is the learned residual function, and $y$ is the block output. The block does not need to learn the entire transformation from scratch. It only needs to learn a correction to the input.

The Optimization Problem in Deep CNNs

Adding more layers should, in principle, make a network more expressive. A deeper network can represent everything a shallower network can represent by making extra layers act like identity functions.

In practice, very deep plain CNNs can be hard to optimize. Training error may get worse as depth increases. This failure is not just overfitting, because the deeper model can perform worse on the training set itself.

Residual connections address this optimization problem. They make it easier for a deep network to preserve useful representations while adding new transformations.

Residual Learning

Instead of asking a block to learn a full mapping

$$ H(x), $$

a residual block learns

$$ F(x) = H(x) - x. $$

Then the final output is

$$ H(x) = F(x) + x. $$

If the best transformation is close to the identity function, the block only needs to learn a small residual. If a layer is unnecessary, the network can push $F(x)$ toward zero and pass $x$ through.

This makes very deep networks easier to train.

A Basic Residual Block

A basic residual block usually contains two $3 \times 3$ convolutions:

$$ \text{conv} \rightarrow \text{normalization} \rightarrow \text{activation} \rightarrow \text{conv} \rightarrow \text{normalization} \rightarrow \text{addition} \rightarrow \text{activation}. $$

In PyTorch:

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()

        self.residual = nn.Sequential(
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
            nn.ReLU(),

            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(x + self.residual(x))

This block preserves shape. If the input has shape

$$ [B, C, H, W], $$

then the output also has shape

$$ [B, C, H, W]. $$

The addition is valid because both tensors have the same shape.

Shape Matching

Residual addition requires identical tensor shapes:

$$ x \in \mathbb{R}^{B \times C \times H \times W}, \quad F(x) \in \mathbb{R}^{B \times C \times H \times W}. $$

If the number of channels or spatial size changes, the shortcut path must be adjusted.

A common solution is a projection shortcut using a $1 \times 1$ convolution:

class ProjectionBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.residual = nn.Sequential(
            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),

            nn.Conv2d(
                out_channels,
                out_channels,
                kernel_size=3,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut = nn.Sequential(
            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                stride=stride,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.residual(x) + self.shortcut(x))

If stride=2, both the residual path and shortcut path downsample the spatial dimensions. If out_channels differs from in_channels, the shortcut also changes the channel count.

Identity Shortcuts

When input and output shapes match, the shortcut can be the identity function:

$$ S(x) = x. $$

Then the block computes

$$ y = F(x) + S(x) = F(x) + x. $$

Identity shortcuts add no parameters and no substantial computation. They are one reason residual networks scale well.

Identity shortcuts also preserve gradient flow. During backpropagation, the gradient can pass directly through the addition operation to earlier layers.

Gradient Flow Through a Residual Block

Let

$$ y = F(x) + x. $$

If the loss is $L$, then by the chain rule,

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \left( \frac{\partial F(x)}{\partial x} + I \right). $$

The identity term $I$ gives a direct gradient path. Even if the derivative of $F$ becomes small, the gradient can still flow through the shortcut path.

This does not eliminate all optimization problems, but it makes deep networks much more stable than plain stacks of convolutions.

Residual Stages

A ResNet is organized into stages. Each stage contains several residual blocks at the same spatial resolution.

A typical shape progression is:

Stage	Output shape example
Stem	$[B, 64, 112, 112]$
Stage 1	$[B, 64, 56, 56]$
Stage 2	$[B, 128, 28, 28]$
Stage 3	$[B, 256, 14, 14]$
Stage 4	$[B, 512, 7, 7]$
Head	$[B, K]$

At the beginning of a new stage, a projection block usually changes the channel count and downsamples spatial resolution. The remaining blocks in that stage use identity shortcuts.

Building a Small ResNet

A small ResNet can be built from residual stages.

class SmallResNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU(),
        )

        self.stage1 = nn.Sequential(
            BasicBlock(32),
            BasicBlock(32),
        )

        self.stage2 = nn.Sequential(
            ProjectionBlock(32, 64, stride=2),
            BasicBlock(64),
        )

        self.stage3 = nn.Sequential(
            ProjectionBlock(64, 128, stride=2),
            BasicBlock(128),
        )

        self.head = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        return self.head(x)

For an input of shape $[B,3,32,32]$, the shape flow is:

Component	Shape
Input	$[B, 3, 32, 32]$
Stem	$[B, 32, 32, 32]$
Stage 1	$[B, 32, 32, 32]$
Stage 2	$[B, 64, 16, 16]$
Stage 3	$[B, 128, 8, 8]$
Pool	$[B, 128, 1, 1]$
Logits	$[B, 10]$

This is much smaller than standard ResNet variants, but it uses the same principles.

Bottleneck Residual Blocks

For deeper networks, a bottleneck block reduces computation. It uses three convolutions:

$$ 1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1. $$

The first $1 \times 1$ convolution reduces channel dimension. The $3 \times 3$ convolution processes spatial information. The final $1 \times 1$ convolution expands the channel dimension.

class BottleneckBlock(nn.Module):
    def __init__(self, in_channels, bottleneck_channels, out_channels, stride=1):
        super().__init__()

        self.residual = nn.Sequential(
            nn.Conv2d(in_channels, bottleneck_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(),

            nn.Conv2d(
                bottleneck_channels,
                bottleneck_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(),

            nn.Conv2d(bottleneck_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )
        else:
            self.shortcut = nn.Identity()

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.residual(x) + self.shortcut(x))

Bottleneck blocks are used in deeper ResNets, such as ResNet-50 and larger variants.

Pre-Activation Residual Blocks

The original residual block applies activation after addition. A pre-activation residual block moves normalization and activation before the convolutions.

A simplified pre-activation block has the form:

$$ \text{norm} \rightarrow \text{activation} \rightarrow \text{conv} \rightarrow \text{norm} \rightarrow \text{activation} \rightarrow \text{conv} \rightarrow \text{addition}. $$

In PyTorch:

class PreActBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()

        self.residual = nn.Sequential(
            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),

            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
        )

    def forward(self, x):
        return x + self.residual(x)

Pre-activation makes the shortcut path cleaner because the identity connection can remain closer to a true identity map. This can improve optimization in very deep residual networks.

Residual Blocks and Normalization

Residual networks are commonly paired with batch normalization. A typical convolutional residual path uses:

$$ \text{Conv2d} \rightarrow \text{BatchNorm2d} \rightarrow \text{ReLU}. $$

Batch normalization stabilizes activation distributions and helps optimization. The convolution often uses bias=False because batch normalization has learnable affine parameters.

nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False)
nn.BatchNorm2d(64)

The batch normalization layer can shift and scale the output, making the convolutional bias redundant in many standard blocks.

Residual Connections Beyond CNNs

Residual connections are now used across deep learning, not only in CNNs. Transformers use residual connections around attention and feedforward sublayers. Diffusion U-Nets use residual blocks throughout the denoising network. Multimodal models also rely on residual paths.

The general idea is architecture-independent:

$$ \text{output} = \text{input} + \text{learned correction}. $$

This is useful whenever very deep networks must be optimized reliably.

Common Implementation Errors

The most common residual network error is shape mismatch during addition. Both tensors must have the same shape:

out = residual + shortcut

If residual.shape != shortcut.shape, PyTorch raises an error. Projection shortcuts solve this.

Another common error is using in-place activation in a way that interferes with gradient computation. PyTorch often handles in-place ReLU correctly, but plain nn.ReLU() is safer while learning.

A third error is forgetting to downsample the shortcut path when the residual path uses stride=2. Both paths must agree on spatial dimensions.

Summary

Residual networks make deep CNNs easier to optimize by adding skip connections. A residual block computes a learned correction $F(x)$ and adds it to a shortcut path.

When shapes match, the shortcut can be an identity. When shapes differ, a $1 \times 1$ projection is used. Basic blocks use two $3 \times 3$ convolutions. Bottleneck blocks use $1 \times 1$, $3 \times 3$, and $1 \times 1$ convolutions for efficiency.

Residual connections improve gradient flow, support very deep architectures, and have become a standard design pattern across CNNs, transformers, diffusion models, and large-scale neural networks.