Residual Connections

Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input.

A residual block has the form

y = x + F(x),

where $x$ is the input, $F(x)$ is a learned transformation, and $y$ is the output.

The function $F$ may be a stack of convolutional layers, a feedforward network, an attention block, or another differentiable module. The addition requires $x$ and $F(x)$ to have the same shape.

Motivation

As networks become deeper, training becomes harder. Even when the architecture has enough capacity, optimization may fail because gradients must pass through many layers. Residual connections create shorter paths for both activations and gradients.

Without a residual connection, a block computes

y = F(x).

With a residual connection, it computes

y = x + F(x).

This makes the identity function easy to represent. If the best transformation for a block is close to doing nothing, the network can set $F(x)$ close to zero. Then

y \approx x.

This is easier than asking a stack of nonlinear layers to learn the identity function directly.

Gradient Flow Through a Residual Block

Residual connections help because the derivative has an identity term.

Given

y = x + F(x),

the derivative of $y$ with respect to $x$ is

\frac{\partial y}{\partial x} = I + \frac{\partial F(x)}{\partial x}.

During backpropagation, the gradient can flow through the identity term $I$ , even if the learned branch $F$ has small or poorly conditioned derivatives.

This does not eliminate vanishing or exploding gradients in all cases, but it gives optimization a much better path through deep networks.

A Basic Residual MLP Block

A residual block for vector inputs can be written directly in PyTorch:

import torch
from torch import nn

class ResidualMLPBlock(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.f = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x):
        return x + self.f(x)

The first linear layer expands the representation. The second linear layer projects it back to dim, so the output of self.f(x) has the same shape as x.

Example:

x = torch.randn(32, 128)

block = ResidualMLPBlock(dim=128, hidden_dim=512)
y = block(x)

print(y.shape)  # torch.Size([32, 128])

The shape is preserved.

Residual CNN Blocks

Residual connections were especially important in deep convolutional networks. A convolutional residual block often has the structure

Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> Add -> ReLU

Example:

class BasicResidualCNNBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()

        self.f = nn.Sequential(
            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
            nn.ReLU(),

            nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(x + self.f(x))

For an input tensor

x \in \mathbb{R}^{B \times C \times H \times W},

the residual branch must return the same shape:

F(x) \in \mathbb{R}^{B \times C \times H \times W}.

In PyTorch:

x = torch.randn(8, 64, 32, 32)

block = BasicResidualCNNBlock(channels=64)
y = block(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

Projection Residual Connections

Sometimes a block changes the number of channels, spatial resolution, or feature dimension. In that case, $x$ and $F(x)$ do not have the same shape. We need a projection on the skip path.

A projection residual block has the form

y = P(x) + F(x),

where $P$ maps the input into the correct output shape.

For CNNs, $P$ is often a $1\times1$ convolution:

class ProjectionResidualCNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.f = nn.Sequential(
            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),

            nn.Conv2d(
                out_channels,
                out_channels,
                kernel_size=3,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
        )

        self.proj = nn.Sequential(
            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                stride=stride,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
        )

        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.proj(x) + self.f(x))

Example:

x = torch.randn(8, 64, 32, 32)

block = ProjectionResidualCNNBlock(
    in_channels=64,
    out_channels=128,
    stride=2,
)

y = block(x)

print(y.shape)  # torch.Size([8, 128, 16, 16])

The residual branch changes the channel count from 64 to 128 and downsamples the spatial dimensions from $32\times32$ to $16\times16$ . The projection branch performs the same shape change, making addition valid.

Residual Connections in Transformers

Transformers use residual connections around attention and feedforward sublayers.

A common pre-normalization transformer block is

x' = x + \operatorname{Attention}(\operatorname{LayerNorm}(x)),

y = x' + \operatorname{FFN}(\operatorname{LayerNorm}(x')).

In PyTorch:

class TransformerResidualBlock(nn.Module):
    def __init__(self, dim, num_heads, hidden_dim):
        super().__init__()

        self.ln1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(
            embed_dim=dim,
            num_heads=num_heads,
            batch_first=True,
        )

        self.ln2 = nn.LayerNorm(dim)
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x, attn_mask=None):
        h = self.ln1(x)
        h, _ = self.attn(h, h, h, attn_mask=attn_mask)
        x = x + h

        h = self.ln2(x)
        h = self.ffn(h)
        x = x + h

        return x

For sequence input

x \in \mathbb{R}^{B \times T \times D},

both the attention sublayer and feedforward sublayer return tensors of the same shape. This makes residual addition possible.

Pre-Norm and Post-Norm Blocks

Residual blocks with normalization can be arranged in different ways.

In post-normalization, normalization is applied after the residual addition:

y = \operatorname{Norm}(x + F(x)).

In pre-normalization, normalization is applied before the learned branch:

y = x + F(\operatorname{Norm}(x)).

Post-normalization was used in early transformer designs. Pre-normalization is common in deeper transformer models because it usually improves gradient flow.

A practical distinction:

Layout	Formula	Typical behavior
Post-norm	$y = \operatorname{Norm}(x + F(x))$	Can work well, but deep models may be harder to optimize
Pre-norm	$y = x + F(\operatorname{Norm}(x))$	Usually more stable for deep transformers

Pre-normalization leaves the residual stream more direct. This gives gradients a cleaner identity path through the model.

Residual Scaling

In very deep networks, repeated residual additions can increase activation scale. Residual scaling reduces the magnitude of the residual branch:

y = x + \alpha F(x),

where $\alpha$ is usually a small constant or learned scalar.

Example:

class ScaledResidualMLPBlock(nn.Module):
    def __init__(self, dim, hidden_dim, scale=0.1):
        super().__init__()
        self.scale = scale
        self.f = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x):
        return x + self.scale * self.f(x)

Residual scaling can help stabilize very deep networks, especially when normalization alone is insufficient.

Some architectures initialize the final layer of a residual branch near zero. This makes the block initially behave close to the identity function.

class ZeroInitResidualMLPBlock(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()

        self.fc1 = nn.Linear(dim, hidden_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, dim)

        nn.init.zeros_(self.fc2.weight)
        nn.init.zeros_(self.fc2.bias)

    def forward(self, x):
        return x + self.fc2(self.act(self.fc1(x)))

At initialization, the residual branch contributes almost nothing, so the block begins close to

y = x.

Shape Requirements

Residual addition requires equal shapes, or at least shapes that broadcast intentionally. In most residual blocks, exact shape equality is preferred.

For vectors:

x.shape      # [B, D]
f_x.shape    # [B, D]

For image feature maps:

x.shape      # [B, C, H, W]
f_x.shape    # [B, C, H, W]

For token sequences:

x.shape      # [B, T, D]
f_x.shape    # [B, T, D]

A common bug is changing the feature dimension in the residual branch without projecting the skip path.

class BadResidualBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.f = nn.Linear(128, 256)

    def forward(self, x):
        return x + self.f(x)  # Shape error

Here x has shape [B, 128], while self.f(x) has shape [B, 256]. These cannot be added.

A corrected version adds a projection:

class GoodResidualBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.f = nn.Linear(128, 256)
        self.proj = nn.Linear(128, 256)

    def forward(self, x):
        return self.proj(x) + self.f(x)

Residual Connections and Model Depth

Residual connections make it possible to train much deeper networks than plain stacked layers. They do this by making each block learn an incremental update rather than an entirely new representation.

We can think of a deep residual network as a sequence of state updates:

x_{l+1} = x_l + F_l(x_l).

Each block slightly modifies the current representation. This view is useful for both CNNs and transformers. A representation is refined layer by layer.

For language models, the residual stream can be viewed as a shared workspace. Attention and feedforward layers read from it, write updates into it, and pass it forward to later layers.

Practical Rules

Use residual connections when building networks deeper than a few layers.

Keep the residual branch output shape identical to the skip path unless you deliberately use a projection.

Use pre-normalization for deep transformer-style blocks.

Use projection shortcuts when changing channel count, feature dimension, or spatial resolution.

Consider residual scaling or zero-initialized residual branches for very deep networks.

Avoid accidental broadcasting in residual additions. Exact shape matching is usually safer.

Summary

Residual connections add a block’s input to its learned transformation:

y = x + F(x).

They improve gradient flow, make identity mappings easy to represent, and allow networks to grow much deeper. CNNs use residual connections to build deep visual models. Transformers use residual connections around attention and feedforward sublayers.

The main implementation rule is shape compatibility. If the learned branch changes shape, the skip path must be projected to the same shape before addition.