Padding and Stride

Padding and stride control the spatial size of convolutional feature maps. Kernel size controls how large a local window the layer sees. Padding controls what happens near the boundary. Stride controls how far the kernel moves between neighboring output positions.

These parameters determine the mapping

[B, C_{\text{in}}, H, W] \rightarrow [B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].

A correct CNN implementation requires careful tracking of these shapes.

Why Padding Is Needed

A convolutional kernel needs a local patch of input values. Near the image boundary, a kernel may extend beyond the available pixels. Padding solves this by adding extra values around the input.

The most common choice is zero padding. For a 2D image, padding $p=1$ adds one row above, one row below, one column on the left, and one column on the right.

If the original input has size

H \times W,

then padding changes it to

(H + 2p) \times (W + 2p).

For example, a $32 \times 32$ image with padding $1$ becomes $34 \times 34$ before the convolution is applied.

Valid and Same Convolutions

A valid convolution uses no padding. The kernel is applied only where it fully fits inside the input. This shrinks the spatial size.

For a $3 \times 3$ kernel and stride $1$ , valid convolution maps

32 \times 32 \rightarrow 30 \times 30.

A same convolution chooses padding so that the output has the same spatial size as the input, usually when stride is $1$ . For a $3 \times 3$ kernel, padding $1$ gives

32 \times 32 \rightarrow 32 \times 32.

For a $5 \times 5$ kernel, padding $2$ preserves size. More generally, for odd kernel size $k$ , stride $1$ , and equal padding on both sides,

p = \frac{k - 1}{2}.

Examples:

Kernel size	Padding for same size, stride 1
1	0
3	1
5	2
7	3

Output Shape Formula

For a 2D convolution with input height $H$ , input width $W$ , kernel size $k_h \times k_w$ , padding $p_h, p_w$ , and stride $s_h, s_w$ , the output size is

$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor

1, $$

$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor

The floor appears because the kernel can only be placed at integer positions. If the final step would go beyond the padded input, it is discarded.

For example, let

H = W = 32,\quad k_h = k_w = 3,\quad p_h = p_w = 1,\quad s_h = s_w = 1.

Then

$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{1} \right\rfloor

So the spatial size is preserved.

Stride

Stride is the step size of the sliding kernel. With stride $1$ , the kernel moves one pixel at a time. With stride $2$ , it moves two pixels at a time.

Stride reduces spatial resolution. A stride-2 convolution usually halves height and width, subject to the exact shape formula.

For example:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=2,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

The spatial dimensions are reduced from $32 \times 32$ to $16 \times 16$ .

Padding Examples in PyTorch

PyTorch uses the NCHW convention for nn.Conv2d:

[B, C, H, W].

No padding:

conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 30, 30])

Padding one pixel:

conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Padding two pixels with a $5 \times 5$ kernel:

conv = nn.Conv2d(3, 16, kernel_size=5, padding=2)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Padding keeps feature maps from shrinking too quickly as layers are stacked.

Asymmetric Padding

Padding can differ across axes. A layer may use more padding along width than height, or padding may differ on the left and right sides.

PyTorch nn.Conv2d supports symmetric padding directly:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=(3, 5),
    padding=(1, 2),
)

This uses padding $1$ along height and $2$ along width. It preserves size for a $3 \times 5$ kernel with stride $1$ .

For fully asymmetric padding, use torch.nn.functional.pad before the convolution.

import torch.nn.functional as F

x = torch.randn(8, 3, 32, 32)

# Padding order: left, right, top, bottom
x_pad = F.pad(x, (1, 2, 3, 4))

conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)
y = conv(x_pad)

print(y.shape)

Asymmetric padding appears in some architectures that need exact alignment between feature maps.

Padding Modes

Zero padding is the default. Other padding modes are sometimes useful.

Padding mode	Description
`zeros`	Pads with zeros
`reflect`	Mirrors values near the boundary
`replicate`	Repeats boundary values
`circular`	Wraps values around from the opposite side

Example:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    padding=1,
    padding_mode="reflect",
)

Zero padding is standard for most CNNs. Reflect padding can reduce boundary artifacts in image restoration tasks. Circular padding is useful only when the data has circular structure, such as periodic signals.

Stride Examples

Stride changes how densely the kernel samples the input.

Stride $1$ :

conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Stride $2$ :

conv = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

Stride $4$ :

conv = nn.Conv2d(3, 16, kernel_size=3, stride=4, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 8, 8])

Large stride reduces computation, but it also discards spatial detail.

Stride as Downsampling

A stride-2 convolution performs feature extraction and downsampling in one operation. This differs from pooling, which downsamples using a fixed rule.

A typical convolutional block may use stride $2$ when entering a new stage:

downsample = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    stride=2,
    padding=1,
)

This maps

[B, 64, 56, 56] \rightarrow [B, 128, 28, 28].

The number of channels increases while spatial size decreases. This is a common design pattern in CNNs.

Stride and Aliasing

Downsampling can create aliasing. Aliasing happens when high-frequency information is sampled too coarsely and appears as distorted low-frequency information.

In signal processing, downsampling is usually preceded by low-pass filtering. In CNNs, stride performs downsampling directly. The learned convolution may partially compensate, but aliasing can still occur.

Some architectures use blur pooling or anti-aliased downsampling. These methods smooth the feature map before reducing its resolution.

The main practical rule is simple: avoid aggressive downsampling before the network has learned enough local structure.

Odd and Even Kernel Sizes

Odd kernel sizes are common because they have a natural center. A $3 \times 3$ kernel has one central position. A $5 \times 5$ kernel also has one central position.

Even kernel sizes, such as $2 \times 2$ or $4 \times 4$ , can be useful but may create alignment issues. There is no single center pixel. This matters when preserving spatial alignment across layers, especially in segmentation, detection, and image generation.

For standard CNN blocks, $3 \times 3$ kernels with padding $1$ are often the default choice.

Shape Tracking Through Blocks

Consider a simple CNN block:

block = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

For input

[B, 3, 64, 64],

the first convolution gives

[B, 32, 64, 64].

The second convolution keeps

[B, 32, 64, 64].

The pooling layer gives

[B, 32, 32, 32].

The spatial size is reduced only at the pooling layer. The convolutions preserve size because they use padding $1$ with $3 \times 3$ kernels.

Padding and Boundary Effects

Padding creates artificial boundary values. With zero padding, pixels near the image border are convolved with some zeros. This can make boundary activations different from interior activations.

In classification, this often causes little trouble. In image restoration, segmentation, and generation, boundary artifacts can matter more.

Possible responses include:

Method	Use case
Larger input crops	Reduce boundary influence
Reflection padding	Image restoration
Valid convolutions	Avoid artificial boundaries
Cropping skip connections	Align encoder-decoder features

No padding rule is best for every task. The choice depends on the architecture and output requirements.

Shape Mismatch Errors

Many CNN bugs are shape bugs. A common error occurs when a feature map is flattened before a linear layer.

features = torch.randn(8, 64, 7, 7)
flat = features.flatten(1)

print(flat.shape)  # torch.Size([8, 3136])

The following linear layer must expect 3136 input features:

classifier = nn.Linear(64 * 7 * 7, 10)

If padding or stride changes earlier in the network, the value $64 \cdot 7 \cdot 7$ may change. Adaptive pooling avoids this problem:

head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(64, 10),
)

This head works for many input spatial sizes because adaptive pooling always produces $1 \times 1$ spatial maps.

Summary

Padding extends the input boundary before convolution. It controls how much spatial size is preserved and how boundary pixels are handled. Stride controls how far the kernel moves. It controls downsampling and computation.

For a convolutional layer, the output shape is determined by input size, kernel size, padding, stride, and dilation. In standard CNNs, $3 \times 3$ kernels with padding $1$ and stride $1$ preserve spatial size. Stride $2$ is commonly used to downsample.

Good CNN design requires disciplined shape tracking. Most implementation errors can be found by writing down the tensor shape after each layer.