Skip to content

Convolution Operations

Convolution is the central operation in convolutional neural networks.

Convolution is the central operation in convolutional neural networks. It gives a neural network a way to process spatial data, such as images, by applying the same small pattern detector across many locations. Instead of connecting every input value to every output value, a convolutional layer uses local connections and shared parameters.

This makes convolutional networks efficient and well suited to data with spatial structure. An edge in the top-left corner of an image and an edge in the bottom-right corner are different in position, but similar in local pattern. A convolutional layer can detect both using the same learned weights.

Local Structure in Images

An image can be represented as a tensor. A grayscale image of height HH and width WW is often written as

XRH×W. X \in \mathbb{R}^{H \times W}.

A color image has channels. For an RGB image,

XRC×H×W, X \in \mathbb{R}^{C \times H \times W},

where C=3C = 3. The three channels correspond to red, green, and blue.

In PyTorch, a batch of images is usually stored as

XRB×C×H×W, X \in \mathbb{R}^{B \times C \times H \times W},

where BB is the batch size.

The key observation is that nearby pixels are strongly related. A small group of neighboring pixels may form an edge, a corner, a texture, or a small part of an object. A convolutional layer exploits this locality.

The Kernel

A convolution uses a small array of learnable weights called a kernel or filter. For a two-dimensional grayscale image, a kernel may have shape

KRkh×kw, K \in \mathbb{R}^{k_h \times k_w},

where khk_h is the kernel height and kwk_w is the kernel width.

For example, a 3×33 \times 3 kernel has nine parameters:

K=[k11k12k13k21k22k23k31k32k33]. K = \begin{bmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \end{bmatrix}.

The kernel is placed over a local region of the image. The layer multiplies corresponding entries and sums the result. Then the kernel moves to another location and repeats the same computation.

This sliding operation produces an output image-like array called a feature map.

Two-Dimensional Cross-Correlation

In deep learning libraries, the operation called convolution is usually cross-correlation. The distinction is minor for implementation, but important mathematically.

For an input image XX and a kernel KK, the two-dimensional cross-correlation output YY is defined by

Yi,j=u=0kh1v=0kw1Ku,vXi+u,j+v. Y_{i,j} = \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{u,v} X_{i+u,j+v}.

The kernel is applied directly, without flipping.

In mathematical convolution, the kernel is flipped before applying it:

Yi,j=u=0kh1v=0kw1Ku,vXiu,jv. Y_{i,j} = \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{u,v} X_{i-u,j-v}.

Most neural network texts and libraries still use the word convolution because the kernel weights are learned. Whether the learned pattern is stored flipped or unflipped does not change the expressive power of the layer.

A Simple Example

Consider a 3×33 \times 3 input and a 2×22 \times 2 kernel:

X=[123456789],K=[1001]. X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}.

The top-left output entry is computed from the top-left 2×22 \times 2 patch:

Y0,0=11+02+04+(1)5=4. Y_{0,0} = 1\cdot1 + 0\cdot2 + 0\cdot4 + (-1)\cdot5 = -4.

The next output entry shifts the kernel one step to the right:

Y0,1=12+03+05+(1)6=4. Y_{0,1} = 1\cdot2 + 0\cdot3 + 0\cdot5 + (-1)\cdot6 = -4.

Continuing over all valid positions gives

Y=[4444]. Y = \begin{bmatrix} -4 & -4 \\ -4 & -4 \end{bmatrix}.

This output has smaller spatial size because the kernel cannot be centered near the boundary without leaving the image.

Valid Convolution

When the kernel is applied only where it fully fits inside the input, the operation is called valid convolution.

If

XRH×W X \in \mathbb{R}^{H \times W}

and

KRkh×kw, K \in \mathbb{R}^{k_h \times k_w},

then the valid output has shape

(Hkh+1)×(Wkw+1). (H - k_h + 1) \times (W - k_w + 1).

For example, a 32×3232 \times 32 image convolved with a 3×33 \times 3 kernel gives an output of size

30×30. 30 \times 30.

Valid convolution discards boundary positions where the kernel would extend beyond the input.

Padding

Padding adds extra values around the boundary of the input. The most common padding value is zero.

For a 22-dimensional input, padding pp means adding pp rows at the top, pp rows at the bottom, pp columns on the left, and pp columns on the right.

If a 3×33 \times 3 kernel is used with padding 11, the spatial size can be preserved:

Hout=H,Wout=W. H_{\text{out}} = H, \quad W_{\text{out}} = W.

This is often called same padding.

Padding allows boundary pixels to influence the output. It also makes it easier to build deep networks because repeated convolutions do not shrink the feature maps too quickly.

In PyTorch:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

The input has 3 channels and spatial size 32×3232 \times 32. The output has 16 channels and the same spatial size.

Stride

Stride controls how far the kernel moves between neighboring positions. A stride of 11 moves the kernel one pixel at a time. A stride of 22 moves it two pixels at a time.

Stride reduces spatial resolution. It is often used to downsample feature maps.

For a one-dimensional intuition, a stride of 11 visits positions

0,1,2,3, 0,1,2,3,\ldots

while a stride of 22 visits

0,2,4,6, 0,2,4,6,\ldots

In two dimensions, stride is applied along height and width.

In PyTorch:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=2,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

The convolution reduces 32×3232 \times 32 feature maps to 16×1616 \times 16.

Output Shape Formula

For a two-dimensional convolution with input height HH, input width WW, kernel size kh×kwk_h \times k_w, padding ph,pwp_h, p_w, and stride sh,sws_h, s_w, the output shape is

$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor

  • 1, $$

$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor

$$

For example, let

H=W=32,kh=kw=3,ph=pw=1,sh=sw=2. H = W = 32,\quad k_h = k_w = 3,\quad p_h = p_w = 1,\quad s_h = s_w = 2.

Then

$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{2} \right\rfloor

  • 1 =

$$

Similarly,

Wout=16. W_{\text{out}} = 16.

Thus the output spatial shape is 16×1616 \times 16.

Multiple Input Channels

Real convolutional layers usually have multiple input channels. For an RGB image, the input has three channels:

XR3×H×W. X \in \mathbb{R}^{3 \times H \times W}.

A single convolutional filter must look across all input channels. Therefore, the kernel for one output channel has shape

KRCin×kh×kw. K \in \mathbb{R}^{C_{\text{in}} \times k_h \times k_w}.

The output at one spatial location is

Yi,j=c=0Cin1u=0kh1v=0kw1Kc,u,vXc,i+u,j+v. Y_{i,j} = \sum_{c=0}^{C_{\text{in}}-1} \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{c,u,v} X_{c,i+u,j+v}.

The filter combines spatial information and channel information at the same time.

For an RGB input and a 3×33 \times 3 kernel, one filter contains

333=27 3 \cdot 3 \cdot 3 = 27

weights, plus usually one bias scalar.

Multiple Output Channels

A convolutional layer usually learns many filters. Each filter produces one output channel.

If a layer has CinC_{\text{in}} input channels and CoutC_{\text{out}} output channels, its weight tensor has shape

Cout×Cin×kh×kw. C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w.

The output has shape

Cout×Hout×Wout. C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}.

For a batch, the full input and output shapes are

XRB×Cin×H×W, X \in \mathbb{R}^{B \times C_{\text{in}} \times H \times W}, YRB×Cout×Hout×Wout. Y \in \mathbb{R}^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}.

In PyTorch:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    padding=1,
)

print(conv.weight.shape)  # torch.Size([64, 3, 3, 3])
print(conv.bias.shape)    # torch.Size([64])

This layer has 64 filters. Each filter sees all 3 input channels and has spatial size 3×33 \times 3.

The number of weight parameters is

64333=1728. 64 \cdot 3 \cdot 3 \cdot 3 = 1728.

With bias, the total number of parameters is

1728+64=1792. 1728 + 64 = 1792.

Convolution as a Linear Operation

A convolutional layer is a linear operation before the activation function. If we fix the kernel weights, the output is a linear function of the input. If we fix the input, the output is also a linear function of the kernel weights.

This matters because convolutional neural networks are built from simple components:

convolutionnonlinearitynormalization or pooling. \text{convolution} \rightarrow \text{nonlinearity} \rightarrow \text{normalization or pooling}.

Without nonlinear activation functions, stacking convolutional layers would still produce an overall linear map. The nonlinearity gives the network the ability to model more complex functions.

Parameter Sharing

A fully connected layer gives each input-output pair its own parameter. For images, this quickly becomes expensive.

Suppose an image has shape

3×224×224. 3 \times 224 \times 224.

Flattening it gives

150528 150528

input values. A fully connected layer with 1000 output units would require

1505281000=150,528,000 150528 \cdot 1000 = 150{,}528{,}000

weights.

A convolutional layer with 64 output channels and 3×33 \times 3 kernels requires only

64333=1728 64 \cdot 3 \cdot 3 \cdot 3 = 1728

weights.

This reduction comes from parameter sharing. The same kernel is used at every spatial location. This encodes the assumption that useful local patterns may appear anywhere in the image.

Sparse Connectivity

A convolutional layer also uses sparse connectivity. Each output value depends only on a small local region of the input.

For a 3×33 \times 3 kernel, each output location depends on only nine spatial positions per input channel. A fully connected layer would connect each output value to every input value.

Sparse connectivity makes convolution efficient and gives it an inductive bias toward local pattern recognition.

The local region that influences one output value is called its receptive field.

Receptive Field

The receptive field of an output unit is the region of the input that can affect it.

In a single 3×33 \times 3 convolution, each output value sees a 3×33 \times 3 input patch. After stacking two 3×33 \times 3 convolutions with stride 11, an output value can depend on a 5×55 \times 5 region of the original input. After three such layers, it can depend on a 7×77 \times 7 region.

Thus deep convolutional networks build large receptive fields from small kernels.

This is one reason 3×33 \times 3 kernels are common. They are parameter efficient, but when stacked they can still capture large spatial context.

Convolution and Translation Equivariance

Convolution is translation equivariant. Informally, if the input shifts, the output shifts in the same way.

Let TT be a spatial shift operator and let ff be a convolution operation. Then convolution satisfies

f(TX)=Tf(X). f(TX) = T f(X).

This property means the same feature detector works across locations. If an edge moves from the left side of an image to the right side, the corresponding feature activation also moves.

Translation equivariance is different from translation invariance. Equivariance preserves location in a transformed form. Invariance removes sensitivity to location. Pooling and global averaging can help build invariance from equivariant feature maps.

Convolution in PyTorch

The main PyTorch class for two-dimensional convolution is torch.nn.Conv2d.

A typical layer is written as:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=3,
    stride=1,
    padding=1,
)

x = torch.randn(16, 3, 64, 64)
y = conv(x)

print(y.shape)  # torch.Size([16, 32, 64, 64])

The arguments mean:

ArgumentMeaning
in_channelsNumber of input channels
out_channelsNumber of learned filters
kernel_sizeSpatial size of each filter
strideStep size of the sliding filter
paddingBoundary padding added to the input
biasWhether to include a bias per output channel

The layer stores its learnable weights in conv.weight and its bias in conv.bias.

print(conv.weight.shape)  # torch.Size([32, 3, 3, 3])
print(conv.bias.shape)    # torch.Size([32])

Manual Convolution in PyTorch

A small convolution can be implemented directly to show the mechanics.

import torch

def conv2d_single_channel(x, k):
    h, w = x.shape
    kh, kw = k.shape

    out_h = h - kh + 1
    out_w = w - kw + 1

    y = torch.empty(out_h, out_w)

    for i in range(out_h):
        for j in range(out_w):
            patch = x[i:i + kh, j:j + kw]
            y[i, j] = (patch * k).sum()

    return y

x = torch.tensor([
    [1., 2., 3.],
    [4., 5., 6.],
    [7., 8., 9.],
])

k = torch.tensor([
    [1., 0.],
    [0., -1.],
])

y = conv2d_single_channel(x, k)
print(y)

The result is

tensor([[-4., -4.],
        [-4., -4.]])

This code is slow because it uses Python loops. PyTorch convolution kernels use optimized implementations that run efficiently on CPUs and GPUs.

Convolution, Bias, and Activation

A convolutional layer usually computes an affine operation:

Zb,o,i,j=bo+c=0Cin1u=0kh1v=0kw1Wo,c,u,vXb,c,i+u,j+v. Z_{b,o,i,j} = b_o + \sum_{c=0}^{C_{\text{in}}-1} \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} W_{o,c,u,v} X_{b,c,i+u,j+v}.

Here:

SymbolMeaning
bbBatch index
ooOutput channel index
ccInput channel index
i,ji,jOutput spatial position
u,vu,vKernel spatial position
WWConvolution weight tensor
bob_oBias for output channel oo

After this affine operation, a nonlinear activation is commonly applied:

Y=σ(Z). Y = \sigma(Z).

For example:

layer = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
)

The convolution produces feature maps. The activation introduces nonlinearity.

Common Kernel Sizes

Common convolution kernel sizes include 1×11 \times 1, 3×33 \times 3, 5×55 \times 5, and 7×77 \times 7.

Kernel sizeCommon role
1×11 \times 1Mix channels without spatial context
3×33 \times 3Standard local pattern extraction
5×55 \times 5Larger local context
7×77 \times 7Early-stage large receptive field

A 1×11 \times 1 convolution may look strange at first because it has no spatial extent beyond one pixel. Its role is to mix channel information at each location. If the input has CinC_{\text{in}} channels and the output has CoutC_{\text{out}} channels, a 1×11 \times 1 convolution applies a learned linear transformation from RCin\mathbb{R}^{C_{\text{in}}} to RCout\mathbb{R}^{C_{\text{out}}} at every spatial location.

This is useful for changing channel dimension, reducing computation, and combining features.

What a Convolution Learns

Early convolutional layers often learn simple local features such as edges, corners, color contrasts, and texture patterns. Deeper layers combine these into larger structures such as object parts and object-level representations.

The kernel values are learned by gradient descent. We do not hand-code the filters. During training, the network adjusts them to reduce the loss.

For a classification model, this means the convolutional filters become useful for separating classes. For a segmentation model, they become useful for predicting pixel-level labels. For a generative model, they become useful for constructing or denoising spatial structure.

Summary

A convolution applies a small learned kernel across spatial locations. The same kernel is reused at every location, producing parameter sharing. Each output value depends on a local input patch, producing sparse connectivity. These two properties make convolution efficient and effective for image-like data.

A two-dimensional convolution layer maps

[B,Cin,H,W] [B, C_{\text{in}}, H, W]

to

[B,Cout,Hout,Wout]. [B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].

The output size depends on kernel size, padding, and stride. The number of parameters depends on input channels, output channels, and kernel size.

Convolution is the basic operation behind classical convolutional neural networks, many vision models, segmentation systems, image generation models, and parts of modern multimodal architectures.