Skip to content

Padding and Stride

Padding and stride control the spatial size of convolutional feature maps.

Padding and stride control the spatial size of convolutional feature maps. Kernel size controls how large a local window the layer sees. Padding controls what happens near the boundary. Stride controls how far the kernel moves between neighboring output positions.

These parameters determine the mapping

[B,Cin,H,W][B,Cout,Hout,Wout]. [B, C_{\text{in}}, H, W] \rightarrow [B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].

A correct CNN implementation requires careful tracking of these shapes.

Why Padding Is Needed

A convolutional kernel needs a local patch of input values. Near the image boundary, a kernel may extend beyond the available pixels. Padding solves this by adding extra values around the input.

The most common choice is zero padding. For a 2D image, padding p=1p=1 adds one row above, one row below, one column on the left, and one column on the right.

If the original input has size

H×W, H \times W,

then padding changes it to

(H+2p)×(W+2p). (H + 2p) \times (W + 2p).

For example, a 32×3232 \times 32 image with padding 11 becomes 34×3434 \times 34 before the convolution is applied.

Valid and Same Convolutions

A valid convolution uses no padding. The kernel is applied only where it fully fits inside the input. This shrinks the spatial size.

For a 3×33 \times 3 kernel and stride 11, valid convolution maps

32×3230×30. 32 \times 32 \rightarrow 30 \times 30.

A same convolution chooses padding so that the output has the same spatial size as the input, usually when stride is 11. For a 3×33 \times 3 kernel, padding 11 gives

32×3232×32. 32 \times 32 \rightarrow 32 \times 32.

For a 5×55 \times 5 kernel, padding 22 preserves size. More generally, for odd kernel size kk, stride 11, and equal padding on both sides,

p=k12. p = \frac{k - 1}{2}.

Examples:

Kernel sizePadding for same size, stride 1
10
31
52
73

Output Shape Formula

For a 2D convolution with input height HH, input width WW, kernel size kh×kwk_h \times k_w, padding ph,pwp_h, p_w, and stride sh,sws_h, s_w, the output size is

$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor

  • 1, $$

$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor

$$

The floor appears because the kernel can only be placed at integer positions. If the final step would go beyond the padded input, it is discarded.

For example, let

H=W=32,kh=kw=3,ph=pw=1,sh=sw=1. H = W = 32,\quad k_h = k_w = 3,\quad p_h = p_w = 1,\quad s_h = s_w = 1.

Then

$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{1} \right\rfloor

  • 1 =

$$

So the spatial size is preserved.

Stride

Stride is the step size of the sliding kernel. With stride 11, the kernel moves one pixel at a time. With stride 22, it moves two pixels at a time.

Stride reduces spatial resolution. A stride-2 convolution usually halves height and width, subject to the exact shape formula.

For example:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=2,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

The spatial dimensions are reduced from 32×3232 \times 32 to 16×1616 \times 16.

Padding Examples in PyTorch

PyTorch uses the NCHW convention for nn.Conv2d:

[B,C,H,W]. [B, C, H, W].

No padding:

conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 30, 30])

Padding one pixel:

conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Padding two pixels with a 5×55 \times 5 kernel:

conv = nn.Conv2d(3, 16, kernel_size=5, padding=2)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Padding keeps feature maps from shrinking too quickly as layers are stacked.

Asymmetric Padding

Padding can differ across axes. A layer may use more padding along width than height, or padding may differ on the left and right sides.

PyTorch nn.Conv2d supports symmetric padding directly:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=(3, 5),
    padding=(1, 2),
)

This uses padding 11 along height and 22 along width. It preserves size for a 3×53 \times 5 kernel with stride 11.

For fully asymmetric padding, use torch.nn.functional.pad before the convolution.

import torch.nn.functional as F

x = torch.randn(8, 3, 32, 32)

# Padding order: left, right, top, bottom
x_pad = F.pad(x, (1, 2, 3, 4))

conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)
y = conv(x_pad)

print(y.shape)

Asymmetric padding appears in some architectures that need exact alignment between feature maps.

Padding Modes

Zero padding is the default. Other padding modes are sometimes useful.

Padding modeDescription
zerosPads with zeros
reflectMirrors values near the boundary
replicateRepeats boundary values
circularWraps values around from the opposite side

Example:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    padding=1,
    padding_mode="reflect",
)

Zero padding is standard for most CNNs. Reflect padding can reduce boundary artifacts in image restoration tasks. Circular padding is useful only when the data has circular structure, such as periodic signals.

Stride Examples

Stride changes how densely the kernel samples the input.

Stride 11:

conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

Stride 22:

conv = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

Stride 44:

conv = nn.Conv2d(3, 16, kernel_size=3, stride=4, padding=1)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 8, 8])

Large stride reduces computation, but it also discards spatial detail.

Stride as Downsampling

A stride-2 convolution performs feature extraction and downsampling in one operation. This differs from pooling, which downsamples using a fixed rule.

A typical convolutional block may use stride 22 when entering a new stage:

downsample = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    stride=2,
    padding=1,
)

This maps

[B,64,56,56][B,128,28,28]. [B, 64, 56, 56] \rightarrow [B, 128, 28, 28].

The number of channels increases while spatial size decreases. This is a common design pattern in CNNs.

Stride and Aliasing

Downsampling can create aliasing. Aliasing happens when high-frequency information is sampled too coarsely and appears as distorted low-frequency information.

In signal processing, downsampling is usually preceded by low-pass filtering. In CNNs, stride performs downsampling directly. The learned convolution may partially compensate, but aliasing can still occur.

Some architectures use blur pooling or anti-aliased downsampling. These methods smooth the feature map before reducing its resolution.

The main practical rule is simple: avoid aggressive downsampling before the network has learned enough local structure.

Odd and Even Kernel Sizes

Odd kernel sizes are common because they have a natural center. A 3×33 \times 3 kernel has one central position. A 5×55 \times 5 kernel also has one central position.

Even kernel sizes, such as 2×22 \times 2 or 4×44 \times 4, can be useful but may create alignment issues. There is no single center pixel. This matters when preserving spatial alignment across layers, especially in segmentation, detection, and image generation.

For standard CNN blocks, 3×33 \times 3 kernels with padding 11 are often the default choice.

Shape Tracking Through Blocks

Consider a simple CNN block:

block = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

For input

[B,3,64,64], [B, 3, 64, 64],

the first convolution gives

[B,32,64,64]. [B, 32, 64, 64].

The second convolution keeps

[B,32,64,64]. [B, 32, 64, 64].

The pooling layer gives

[B,32,32,32]. [B, 32, 32, 32].

The spatial size is reduced only at the pooling layer. The convolutions preserve size because they use padding 11 with 3×33 \times 3 kernels.

Padding and Boundary Effects

Padding creates artificial boundary values. With zero padding, pixels near the image border are convolved with some zeros. This can make boundary activations different from interior activations.

In classification, this often causes little trouble. In image restoration, segmentation, and generation, boundary artifacts can matter more.

Possible responses include:

MethodUse case
Larger input cropsReduce boundary influence
Reflection paddingImage restoration
Valid convolutionsAvoid artificial boundaries
Cropping skip connectionsAlign encoder-decoder features

No padding rule is best for every task. The choice depends on the architecture and output requirements.

Shape Mismatch Errors

Many CNN bugs are shape bugs. A common error occurs when a feature map is flattened before a linear layer.

features = torch.randn(8, 64, 7, 7)
flat = features.flatten(1)

print(flat.shape)  # torch.Size([8, 3136])

The following linear layer must expect 3136 input features:

classifier = nn.Linear(64 * 7 * 7, 10)

If padding or stride changes earlier in the network, the value 647764 \cdot 7 \cdot 7 may change. Adaptive pooling avoids this problem:

head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(64, 10),
)

This head works for many input spatial sizes because adaptive pooling always produces 1×11 \times 1 spatial maps.

Summary

Padding extends the input boundary before convolution. It controls how much spatial size is preserved and how boundary pixels are handled. Stride controls how far the kernel moves. It controls downsampling and computation.

For a convolutional layer, the output shape is determined by input size, kernel size, padding, stride, and dilation. In standard CNNs, 3×33 \times 3 kernels with padding 11 and stride 11 preserve spatial size. Stride 22 is commonly used to downsample.

Good CNN design requires disciplined shape tracking. Most implementation errors can be found by writing down the tensor shape after each layer.