Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input.
Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input.
A residual block has the form
where is the input, is a learned transformation, and is the output.
The function may be a stack of convolutional layers, a feedforward network, an attention block, or another differentiable module. The addition requires and to have the same shape.
Motivation
As networks become deeper, training becomes harder. Even when the architecture has enough capacity, optimization may fail because gradients must pass through many layers. Residual connections create shorter paths for both activations and gradients.
Without a residual connection, a block computes
With a residual connection, it computes
This makes the identity function easy to represent. If the best transformation for a block is close to doing nothing, the network can set close to zero. Then
This is easier than asking a stack of nonlinear layers to learn the identity function directly.
Gradient Flow Through a Residual Block
Residual connections help because the derivative has an identity term.
Given
the derivative of with respect to is
During backpropagation, the gradient can flow through the identity term , even if the learned branch has small or poorly conditioned derivatives.
This does not eliminate vanishing or exploding gradients in all cases, but it gives optimization a much better path through deep networks.
A Basic Residual MLP Block
A residual block for vector inputs can be written directly in PyTorch:
import torch
from torch import nn
class ResidualMLPBlock(nn.Module):
def __init__(self, dim, hidden_dim):
super().__init__()
self.f = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, dim),
)
def forward(self, x):
return x + self.f(x)The first linear layer expands the representation. The second linear layer projects it back to dim, so the output of self.f(x) has the same shape as x.
Example:
x = torch.randn(32, 128)
block = ResidualMLPBlock(dim=128, hidden_dim=512)
y = block(x)
print(y.shape) # torch.Size([32, 128])The shape is preserved.
Residual CNN Blocks
Residual connections were especially important in deep convolutional networks. A convolutional residual block often has the structure
Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> Add -> ReLUExample:
class BasicResidualCNNBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.f = nn.Sequential(
nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(channels),
nn.ReLU(),
nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(channels),
)
self.activation = nn.ReLU()
def forward(self, x):
return self.activation(x + self.f(x))For an input tensor
the residual branch must return the same shape:
In PyTorch:
x = torch.randn(8, 64, 32, 32)
block = BasicResidualCNNBlock(channels=64)
y = block(x)
print(y.shape) # torch.Size([8, 64, 32, 32])Projection Residual Connections
Sometimes a block changes the number of channels, spatial resolution, or feature dimension. In that case, and do not have the same shape. We need a projection on the skip path.
A projection residual block has the form
where maps the input into the correct output shape.
For CNNs, is often a convolution:
class ProjectionResidualCNNBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.f = nn.Sequential(
nn.Conv2d(
in_channels,
out_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False,
),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
nn.Conv2d(
out_channels,
out_channels,
kernel_size=3,
padding=1,
bias=False,
),
nn.BatchNorm2d(out_channels),
)
self.proj = nn.Sequential(
nn.Conv2d(
in_channels,
out_channels,
kernel_size=1,
stride=stride,
bias=False,
),
nn.BatchNorm2d(out_channels),
)
self.activation = nn.ReLU()
def forward(self, x):
return self.activation(self.proj(x) + self.f(x))Example:
x = torch.randn(8, 64, 32, 32)
block = ProjectionResidualCNNBlock(
in_channels=64,
out_channels=128,
stride=2,
)
y = block(x)
print(y.shape) # torch.Size([8, 128, 16, 16])The residual branch changes the channel count from 64 to 128 and downsamples the spatial dimensions from to . The projection branch performs the same shape change, making addition valid.
Residual Connections in Transformers
Transformers use residual connections around attention and feedforward sublayers.
A common pre-normalization transformer block is
In PyTorch:
class TransformerResidualBlock(nn.Module):
def __init__(self, dim, num_heads, hidden_dim):
super().__init__()
self.ln1 = nn.LayerNorm(dim)
self.attn = nn.MultiheadAttention(
embed_dim=dim,
num_heads=num_heads,
batch_first=True,
)
self.ln2 = nn.LayerNorm(dim)
self.ffn = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, dim),
)
def forward(self, x, attn_mask=None):
h = self.ln1(x)
h, _ = self.attn(h, h, h, attn_mask=attn_mask)
x = x + h
h = self.ln2(x)
h = self.ffn(h)
x = x + h
return xFor sequence input
both the attention sublayer and feedforward sublayer return tensors of the same shape. This makes residual addition possible.
Pre-Norm and Post-Norm Blocks
Residual blocks with normalization can be arranged in different ways.
In post-normalization, normalization is applied after the residual addition:
In pre-normalization, normalization is applied before the learned branch:
Post-normalization was used in early transformer designs. Pre-normalization is common in deeper transformer models because it usually improves gradient flow.
A practical distinction:
| Layout | Formula | Typical behavior |
|---|---|---|
| Post-norm | Can work well, but deep models may be harder to optimize | |
| Pre-norm | Usually more stable for deep transformers |
Pre-normalization leaves the residual stream more direct. This gives gradients a cleaner identity path through the model.
Residual Scaling
In very deep networks, repeated residual additions can increase activation scale. Residual scaling reduces the magnitude of the residual branch:
where is usually a small constant or learned scalar.
Example:
class ScaledResidualMLPBlock(nn.Module):
def __init__(self, dim, hidden_dim, scale=0.1):
super().__init__()
self.scale = scale
self.f = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, dim),
)
def forward(self, x):
return x + self.scale * self.f(x)Residual scaling can help stabilize very deep networks, especially when normalization alone is insufficient.
Some architectures initialize the final layer of a residual branch near zero. This makes the block initially behave close to the identity function.
class ZeroInitResidualMLPBlock(nn.Module):
def __init__(self, dim, hidden_dim):
super().__init__()
self.fc1 = nn.Linear(dim, hidden_dim)
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden_dim, dim)
nn.init.zeros_(self.fc2.weight)
nn.init.zeros_(self.fc2.bias)
def forward(self, x):
return x + self.fc2(self.act(self.fc1(x)))At initialization, the residual branch contributes almost nothing, so the block begins close to
Shape Requirements
Residual addition requires equal shapes, or at least shapes that broadcast intentionally. In most residual blocks, exact shape equality is preferred.
For vectors:
x.shape # [B, D]
f_x.shape # [B, D]For image feature maps:
x.shape # [B, C, H, W]
f_x.shape # [B, C, H, W]For token sequences:
x.shape # [B, T, D]
f_x.shape # [B, T, D]A common bug is changing the feature dimension in the residual branch without projecting the skip path.
class BadResidualBlock(nn.Module):
def __init__(self):
super().__init__()
self.f = nn.Linear(128, 256)
def forward(self, x):
return x + self.f(x) # Shape errorHere x has shape [B, 128], while self.f(x) has shape [B, 256]. These cannot be added.
A corrected version adds a projection:
class GoodResidualBlock(nn.Module):
def __init__(self):
super().__init__()
self.f = nn.Linear(128, 256)
self.proj = nn.Linear(128, 256)
def forward(self, x):
return self.proj(x) + self.f(x)Residual Connections and Model Depth
Residual connections make it possible to train much deeper networks than plain stacked layers. They do this by making each block learn an incremental update rather than an entirely new representation.
We can think of a deep residual network as a sequence of state updates:
Each block slightly modifies the current representation. This view is useful for both CNNs and transformers. A representation is refined layer by layer.
For language models, the residual stream can be viewed as a shared workspace. Attention and feedforward layers read from it, write updates into it, and pass it forward to later layers.
Practical Rules
Use residual connections when building networks deeper than a few layers.
Keep the residual branch output shape identical to the skip path unless you deliberately use a projection.
Use pre-normalization for deep transformer-style blocks.
Use projection shortcuts when changing channel count, feature dimension, or spatial resolution.
Consider residual scaling or zero-initialized residual branches for very deep networks.
Avoid accidental broadcasting in residual additions. Exact shape matching is usually safer.
Summary
Residual connections add a block’s input to its learned transformation:
They improve gradient flow, make identity mappings easy to represent, and allow networks to grow much deeper. CNNs use residual connections to build deep visual models. Transformers use residual connections around attention and feedforward sublayers.
The main implementation rule is shape compatibility. If the learned branch changes shape, the skip path must be projected to the same shape before addition.