Convolution is the central operation in convolutional neural networks. It gives a neural network a way to process spatial data, such as images, by applying the same small pattern detector across many locations. Instead of connecting every input value to every output value, a convolutional layer uses local connections and shared parameters.
This makes convolutional networks efficient and well suited to data with spatial structure. An edge in the top-left corner of an image and an edge in the bottom-right corner are different in position, but similar in local pattern. A convolutional layer can detect both using the same learned weights.
Local Structure in Images
An image can be represented as a tensor. A grayscale image of height and width is often written as
A color image has channels. For an RGB image,
where . The three channels correspond to red, green, and blue.
In PyTorch, a batch of images is usually stored as
where is the batch size.
The key observation is that nearby pixels are strongly related. A small group of neighboring pixels may form an edge, a corner, a texture, or a small part of an object. A convolutional layer exploits this locality.
The Kernel
A convolution uses a small array of learnable weights called a kernel or filter. For a two-dimensional grayscale image, a kernel may have shape
where is the kernel height and is the kernel width.
For example, a kernel has nine parameters:
The kernel is placed over a local region of the image. The layer multiplies corresponding entries and sums the result. Then the kernel moves to another location and repeats the same computation.
This sliding operation produces an output image-like array called a feature map.
Two-Dimensional Cross-Correlation
In deep learning libraries, the operation called convolution is usually cross-correlation. The distinction is minor for implementation, but important mathematically.
For an input image and a kernel , the two-dimensional cross-correlation output is defined by
The kernel is applied directly, without flipping.
In mathematical convolution, the kernel is flipped before applying it:
Most neural network texts and libraries still use the word convolution because the kernel weights are learned. Whether the learned pattern is stored flipped or unflipped does not change the expressive power of the layer.
A Simple Example
Consider a input and a kernel:
The top-left output entry is computed from the top-left patch:
The next output entry shifts the kernel one step to the right:
Continuing over all valid positions gives
This output has smaller spatial size because the kernel cannot be centered near the boundary without leaving the image.
Valid Convolution
When the kernel is applied only where it fully fits inside the input, the operation is called valid convolution.
If
and
then the valid output has shape
For example, a image convolved with a kernel gives an output of size
Valid convolution discards boundary positions where the kernel would extend beyond the input.
Padding
Padding adds extra values around the boundary of the input. The most common padding value is zero.
For a -dimensional input, padding means adding rows at the top, rows at the bottom, columns on the left, and columns on the right.
If a kernel is used with padding , the spatial size can be preserved:
This is often called same padding.
Padding allows boundary pixels to influence the output. It also makes it easier to build deep networks because repeated convolutions do not shrink the feature maps too quickly.
In PyTorch:
import torch
import torch.nn as nn
conv = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
padding=1,
)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 32, 32])The input has 3 channels and spatial size . The output has 16 channels and the same spatial size.
Stride
Stride controls how far the kernel moves between neighboring positions. A stride of moves the kernel one pixel at a time. A stride of moves it two pixels at a time.
Stride reduces spatial resolution. It is often used to downsample feature maps.
For a one-dimensional intuition, a stride of visits positions
while a stride of visits
In two dimensions, stride is applied along height and width.
In PyTorch:
conv = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
stride=2,
padding=1,
)
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 16, 16, 16])The convolution reduces feature maps to .
Output Shape Formula
For a two-dimensional convolution with input height , input width , kernel size , padding , and stride , the output shape is
$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor
- 1, $$
$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor
$$
For example, let
Then
$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{2} \right\rfloor
- 1 =
$$
Similarly,
Thus the output spatial shape is .
Multiple Input Channels
Real convolutional layers usually have multiple input channels. For an RGB image, the input has three channels:
A single convolutional filter must look across all input channels. Therefore, the kernel for one output channel has shape
The output at one spatial location is
The filter combines spatial information and channel information at the same time.
For an RGB input and a kernel, one filter contains
weights, plus usually one bias scalar.
Multiple Output Channels
A convolutional layer usually learns many filters. Each filter produces one output channel.
If a layer has input channels and output channels, its weight tensor has shape
The output has shape
For a batch, the full input and output shapes are
In PyTorch:
conv = nn.Conv2d(
in_channels=3,
out_channels=64,
kernel_size=3,
padding=1,
)
print(conv.weight.shape) # torch.Size([64, 3, 3, 3])
print(conv.bias.shape) # torch.Size([64])This layer has 64 filters. Each filter sees all 3 input channels and has spatial size .
The number of weight parameters is
With bias, the total number of parameters is
Convolution as a Linear Operation
A convolutional layer is a linear operation before the activation function. If we fix the kernel weights, the output is a linear function of the input. If we fix the input, the output is also a linear function of the kernel weights.
This matters because convolutional neural networks are built from simple components:
Without nonlinear activation functions, stacking convolutional layers would still produce an overall linear map. The nonlinearity gives the network the ability to model more complex functions.
Parameter Sharing
A fully connected layer gives each input-output pair its own parameter. For images, this quickly becomes expensive.
Suppose an image has shape
Flattening it gives
input values. A fully connected layer with 1000 output units would require
weights.
A convolutional layer with 64 output channels and kernels requires only
weights.
This reduction comes from parameter sharing. The same kernel is used at every spatial location. This encodes the assumption that useful local patterns may appear anywhere in the image.
Sparse Connectivity
A convolutional layer also uses sparse connectivity. Each output value depends only on a small local region of the input.
For a kernel, each output location depends on only nine spatial positions per input channel. A fully connected layer would connect each output value to every input value.
Sparse connectivity makes convolution efficient and gives it an inductive bias toward local pattern recognition.
The local region that influences one output value is called its receptive field.
Receptive Field
The receptive field of an output unit is the region of the input that can affect it.
In a single convolution, each output value sees a input patch. After stacking two convolutions with stride , an output value can depend on a region of the original input. After three such layers, it can depend on a region.
Thus deep convolutional networks build large receptive fields from small kernels.
This is one reason kernels are common. They are parameter efficient, but when stacked they can still capture large spatial context.
Convolution and Translation Equivariance
Convolution is translation equivariant. Informally, if the input shifts, the output shifts in the same way.
Let be a spatial shift operator and let be a convolution operation. Then convolution satisfies
This property means the same feature detector works across locations. If an edge moves from the left side of an image to the right side, the corresponding feature activation also moves.
Translation equivariance is different from translation invariance. Equivariance preserves location in a transformed form. Invariance removes sensitivity to location. Pooling and global averaging can help build invariance from equivariant feature maps.
Convolution in PyTorch
The main PyTorch class for two-dimensional convolution is torch.nn.Conv2d.
A typical layer is written as:
import torch
import torch.nn as nn
conv = nn.Conv2d(
in_channels=3,
out_channels=32,
kernel_size=3,
stride=1,
padding=1,
)
x = torch.randn(16, 3, 64, 64)
y = conv(x)
print(y.shape) # torch.Size([16, 32, 64, 64])The arguments mean:
| Argument | Meaning |
|---|---|
in_channels | Number of input channels |
out_channels | Number of learned filters |
kernel_size | Spatial size of each filter |
stride | Step size of the sliding filter |
padding | Boundary padding added to the input |
bias | Whether to include a bias per output channel |
The layer stores its learnable weights in conv.weight and its bias in conv.bias.
print(conv.weight.shape) # torch.Size([32, 3, 3, 3])
print(conv.bias.shape) # torch.Size([32])Manual Convolution in PyTorch
A small convolution can be implemented directly to show the mechanics.
import torch
def conv2d_single_channel(x, k):
h, w = x.shape
kh, kw = k.shape
out_h = h - kh + 1
out_w = w - kw + 1
y = torch.empty(out_h, out_w)
for i in range(out_h):
for j in range(out_w):
patch = x[i:i + kh, j:j + kw]
y[i, j] = (patch * k).sum()
return y
x = torch.tensor([
[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.],
])
k = torch.tensor([
[1., 0.],
[0., -1.],
])
y = conv2d_single_channel(x, k)
print(y)The result is
tensor([[-4., -4.],
[-4., -4.]])This code is slow because it uses Python loops. PyTorch convolution kernels use optimized implementations that run efficiently on CPUs and GPUs.
Convolution, Bias, and Activation
A convolutional layer usually computes an affine operation:
Here:
| Symbol | Meaning |
|---|---|
| Batch index | |
| Output channel index | |
| Input channel index | |
| Output spatial position | |
| Kernel spatial position | |
| Convolution weight tensor | |
| Bias for output channel |
After this affine operation, a nonlinear activation is commonly applied:
For example:
layer = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
)The convolution produces feature maps. The activation introduces nonlinearity.
Common Kernel Sizes
Common convolution kernel sizes include , , , and .
| Kernel size | Common role |
|---|---|
| Mix channels without spatial context | |
| Standard local pattern extraction | |
| Larger local context | |
| Early-stage large receptive field |
A convolution may look strange at first because it has no spatial extent beyond one pixel. Its role is to mix channel information at each location. If the input has channels and the output has channels, a convolution applies a learned linear transformation from to at every spatial location.
This is useful for changing channel dimension, reducing computation, and combining features.
What a Convolution Learns
Early convolutional layers often learn simple local features such as edges, corners, color contrasts, and texture patterns. Deeper layers combine these into larger structures such as object parts and object-level representations.
The kernel values are learned by gradient descent. We do not hand-code the filters. During training, the network adjusts them to reduce the loss.
For a classification model, this means the convolutional filters become useful for separating classes. For a segmentation model, they become useful for predicting pixel-level labels. For a generative model, they become useful for constructing or denoising spatial structure.
Summary
A convolution applies a small learned kernel across spatial locations. The same kernel is reused at every location, producing parameter sharing. Each output value depends on a local input patch, producing sparse connectivity. These two properties make convolution efficient and effective for image-like data.
A two-dimensional convolution layer maps
to
The output size depends on kernel size, padding, and stride. The number of parameters depends on input channels, output channels, and kernel size.
Convolution is the basic operation behind classical convolutional neural networks, many vision models, segmentation systems, image generation models, and parts of modern multimodal architectures.