Padding and stride control the spatial size of convolutional feature maps. Kernel size controls how large a local window the layer sees. Padding controls what happens near the boundary. Stride controls how far the kernel moves between neighboring output positions.
These parameters determine the mapping
A correct CNN implementation requires careful tracking of these shapes.
Why Padding Is Needed
A convolutional kernel needs a local patch of input values. Near the image boundary, a kernel may extend beyond the available pixels. Padding solves this by adding extra values around the input.
The most common choice is zero padding. For a 2D image, padding adds one row above, one row below, one column on the left, and one column on the right.
If the original input has size
then padding changes it to
For example, a image with padding becomes before the convolution is applied.
Valid and Same Convolutions
A valid convolution uses no padding. The kernel is applied only where it fully fits inside the input. This shrinks the spatial size.
For a kernel and stride , valid convolution maps
A same convolution chooses padding so that the output has the same spatial size as the input, usually when stride is . For a kernel, padding gives
For a kernel, padding preserves size. More generally, for odd kernel size , stride , and equal padding on both sides,
Examples:
| Kernel size | Padding for same size, stride 1 |
|---|---|
| 1 | 0 |
| 3 | 1 |
| 5 | 2 |
| 7 | 3 |
Output Shape Formula
For a 2D convolution with input height , input width , kernel size , padding , and stride , the output size is
$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor
- 1, $$
$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor
$$
The floor appears because the kernel can only be placed at integer positions. If the final step would go beyond the padded input, it is discarded.
For example, let
Then
$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{1} \right\rfloor
- 1 =
$$
So the spatial size is preserved.
Stride
Stride is the step size of the sliding kernel. With stride , the kernel moves one pixel at a time. With stride , it moves two pixels at a time.
Stride reduces spatial resolution. A stride-2 convolution usually halves height and width, subject to the exact shape formula.
For example:
import torch
import torch.nn as nn
conv = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
stride=2,
padding=1,
)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 16, 16])The spatial dimensions are reduced from to .
Padding Examples in PyTorch
PyTorch uses the NCHW convention for nn.Conv2d:
No padding:
conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 30, 30])Padding one pixel:
conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 32, 32])Padding two pixels with a kernel:
conv = nn.Conv2d(3, 16, kernel_size=5, padding=2)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 32, 32])Padding keeps feature maps from shrinking too quickly as layers are stacked.
Asymmetric Padding
Padding can differ across axes. A layer may use more padding along width than height, or padding may differ on the left and right sides.
PyTorch nn.Conv2d supports symmetric padding directly:
conv = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=(3, 5),
padding=(1, 2),
)This uses padding along height and along width. It preserves size for a kernel with stride .
For fully asymmetric padding, use torch.nn.functional.pad before the convolution.
import torch.nn.functional as F
x = torch.randn(8, 3, 32, 32)
# Padding order: left, right, top, bottom
x_pad = F.pad(x, (1, 2, 3, 4))
conv = nn.Conv2d(3, 16, kernel_size=3, padding=0)
y = conv(x_pad)
print(y.shape)Asymmetric padding appears in some architectures that need exact alignment between feature maps.
Padding Modes
Zero padding is the default. Other padding modes are sometimes useful.
| Padding mode | Description |
|---|---|
zeros | Pads with zeros |
reflect | Mirrors values near the boundary |
replicate | Repeats boundary values |
circular | Wraps values around from the opposite side |
Example:
conv = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
padding=1,
padding_mode="reflect",
)Zero padding is standard for most CNNs. Reflect padding can reduce boundary artifacts in image restoration tasks. Circular padding is useful only when the data has circular structure, such as periodic signals.
Stride Examples
Stride changes how densely the kernel samples the input.
Stride :
conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 32, 32])Stride :
conv = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 16, 16])Stride :
conv = nn.Conv2d(3, 16, kernel_size=3, stride=4, padding=1)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 8, 8])Large stride reduces computation, but it also discards spatial detail.
Stride as Downsampling
A stride-2 convolution performs feature extraction and downsampling in one operation. This differs from pooling, which downsamples using a fixed rule.
A typical convolutional block may use stride when entering a new stage:
downsample = nn.Conv2d(
in_channels=64,
out_channels=128,
kernel_size=3,
stride=2,
padding=1,
)This maps
The number of channels increases while spatial size decreases. This is a common design pattern in CNNs.
Stride and Aliasing
Downsampling can create aliasing. Aliasing happens when high-frequency information is sampled too coarsely and appears as distorted low-frequency information.
In signal processing, downsampling is usually preceded by low-pass filtering. In CNNs, stride performs downsampling directly. The learned convolution may partially compensate, but aliasing can still occur.
Some architectures use blur pooling or anti-aliased downsampling. These methods smooth the feature map before reducing its resolution.
The main practical rule is simple: avoid aggressive downsampling before the network has learned enough local structure.
Odd and Even Kernel Sizes
Odd kernel sizes are common because they have a natural center. A kernel has one central position. A kernel also has one central position.
Even kernel sizes, such as or , can be useful but may create alignment issues. There is no single center pixel. This matters when preserving spatial alignment across layers, especially in segmentation, detection, and image generation.
For standard CNN blocks, kernels with padding are often the default choice.
Shape Tracking Through Blocks
Consider a simple CNN block:
block = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
)For input
the first convolution gives
The second convolution keeps
The pooling layer gives
The spatial size is reduced only at the pooling layer. The convolutions preserve size because they use padding with kernels.
Padding and Boundary Effects
Padding creates artificial boundary values. With zero padding, pixels near the image border are convolved with some zeros. This can make boundary activations different from interior activations.
In classification, this often causes little trouble. In image restoration, segmentation, and generation, boundary artifacts can matter more.
Possible responses include:
| Method | Use case |
|---|---|
| Larger input crops | Reduce boundary influence |
| Reflection padding | Image restoration |
| Valid convolutions | Avoid artificial boundaries |
| Cropping skip connections | Align encoder-decoder features |
No padding rule is best for every task. The choice depends on the architecture and output requirements.
Shape Mismatch Errors
Many CNN bugs are shape bugs. A common error occurs when a feature map is flattened before a linear layer.
features = torch.randn(8, 64, 7, 7)
flat = features.flatten(1)
print(flat.shape) # torch.Size([8, 3136])The following linear layer must expect 3136 input features:
classifier = nn.Linear(64 * 7 * 7, 10)If padding or stride changes earlier in the network, the value may change. Adaptive pooling avoids this problem:
head = nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(64, 10),
)This head works for many input spatial sizes because adaptive pooling always produces spatial maps.
Summary
Padding extends the input boundary before convolution. It controls how much spatial size is preserved and how boundary pixels are handled. Stride controls how far the kernel moves. It controls downsampling and computation.
For a convolutional layer, the output shape is determined by input size, kernel size, padding, stride, and dilation. In standard CNNs, kernels with padding and stride preserve spatial size. Stride is commonly used to downsample.
Good CNN design requires disciplined shape tracking. Most implementation errors can be found by writing down the tensor shape after each layer.