Convolution Operations

Convolution is the central operation in convolutional neural networks. It gives a neural network a way to process spatial data, such as images, by applying the same small pattern detector across many locations. Instead of connecting every input value to every output value, a convolutional layer uses local connections and shared parameters.

This makes convolutional networks efficient and well suited to data with spatial structure. An edge in the top-left corner of an image and an edge in the bottom-right corner are different in position, but similar in local pattern. A convolutional layer can detect both using the same learned weights.

Local Structure in Images

An image can be represented as a tensor. A grayscale image of height $H$ and width $W$ is often written as

X \in \mathbb{R}^{H \times W}.

A color image has channels. For an RGB image,

X \in \mathbb{R}^{C \times H \times W},

where $C = 3$ . The three channels correspond to red, green, and blue.

In PyTorch, a batch of images is usually stored as

X \in \mathbb{R}^{B \times C \times H \times W},

where $B$ is the batch size.

The key observation is that nearby pixels are strongly related. A small group of neighboring pixels may form an edge, a corner, a texture, or a small part of an object. A convolutional layer exploits this locality.

The Kernel

A convolution uses a small array of learnable weights called a kernel or filter. For a two-dimensional grayscale image, a kernel may have shape

K \in \mathbb{R}^{k_h \times k_w},

where $k_h$ is the kernel height and $k_w$ is the kernel width.

For example, a $3 \times 3$ kernel has nine parameters:

K = \begin{bmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \end{bmatrix}.

The kernel is placed over a local region of the image. The layer multiplies corresponding entries and sums the result. Then the kernel moves to another location and repeats the same computation.

This sliding operation produces an output image-like array called a feature map.

Two-Dimensional Cross-Correlation

In deep learning libraries, the operation called convolution is usually cross-correlation. The distinction is minor for implementation, but important mathematically.

For an input image $X$ and a kernel $K$ , the two-dimensional cross-correlation output $Y$ is defined by

Y_{i,j} = \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{u,v} X_{i+u,j+v}.

The kernel is applied directly, without flipping.

In mathematical convolution, the kernel is flipped before applying it:

Y_{i,j} = \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{u,v} X_{i-u,j-v}.

Most neural network texts and libraries still use the word convolution because the kernel weights are learned. Whether the learned pattern is stored flipped or unflipped does not change the expressive power of the layer.

A Simple Example

Consider a $3 \times 3$ input and a $2 \times 2$ kernel:

X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}.

The top-left output entry is computed from the top-left $2 \times 2$ patch:

Y_{0,0} = 1\cdot1 + 0\cdot2 + 0\cdot4 + (-1)\cdot5 = -4.

The next output entry shifts the kernel one step to the right:

Y_{0,1} = 1\cdot2 + 0\cdot3 + 0\cdot5 + (-1)\cdot6 = -4.

Continuing over all valid positions gives

Y = \begin{bmatrix} -4 & -4 \\ -4 & -4 \end{bmatrix}.

This output has smaller spatial size because the kernel cannot be centered near the boundary without leaving the image.

Valid Convolution

When the kernel is applied only where it fully fits inside the input, the operation is called valid convolution.

X \in \mathbb{R}^{H \times W}

and

K \in \mathbb{R}^{k_h \times k_w},

then the valid output has shape

(H - k_h + 1) \times (W - k_w + 1).

For example, a $32 \times 32$ image convolved with a $3 \times 3$ kernel gives an output of size

30 \times 30.

Valid convolution discards boundary positions where the kernel would extend beyond the input.

Padding

Padding adds extra values around the boundary of the input. The most common padding value is zero.

For a $2$ -dimensional input, padding $p$ means adding $p$ rows at the top, $p$ rows at the bottom, $p$ columns on the left, and $p$ columns on the right.

If a $3 \times 3$ kernel is used with padding $1$ , the spatial size can be preserved:

H_{\text{out}} = H, \quad W_{\text{out}} = W.

This is often called same padding.

Padding allows boundary pixels to influence the output. It also makes it easier to build deep networks because repeated convolutions do not shrink the feature maps too quickly.

In PyTorch:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])

The input has 3 channels and spatial size $32 \times 32$ . The output has 16 channels and the same spatial size.

Stride

Stride controls how far the kernel moves between neighboring positions. A stride of $1$ moves the kernel one pixel at a time. A stride of $2$ moves it two pixels at a time.

Stride reduces spatial resolution. It is often used to downsample feature maps.

For a one-dimensional intuition, a stride of $1$ visits positions

0,1,2,3,\ldots

while a stride of $2$ visits

0,2,4,6,\ldots

In two dimensions, stride is applied along height and width.

In PyTorch:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=2,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

The convolution reduces $32 \times 32$ feature maps to $16 \times 16$ .

Output Shape Formula

For a two-dimensional convolution with input height $H$ , input width $W$ , kernel size $k_h \times k_w$ , padding $p_h, p_w$ , and stride $s_h, s_w$ , the output shape is

$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor

1, $$

$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor

For example, let

H = W = 32,\quad k_h = k_w = 3,\quad p_h = p_w = 1,\quad s_h = s_w = 2.

Then

$$ H_{\text{out}} = \left\lfloor \frac{32 + 2 - 3}{2} \right\rfloor

Similarly,

W_{\text{out}} = 16.

Thus the output spatial shape is $16 \times 16$ .

Multiple Input Channels

Real convolutional layers usually have multiple input channels. For an RGB image, the input has three channels:

X \in \mathbb{R}^{3 \times H \times W}.

A single convolutional filter must look across all input channels. Therefore, the kernel for one output channel has shape

K \in \mathbb{R}^{C_{\text{in}} \times k_h \times k_w}.

The output at one spatial location is

Y_{i,j} = \sum_{c=0}^{C_{\text{in}}-1} \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} K_{c,u,v} X_{c,i+u,j+v}.

The filter combines spatial information and channel information at the same time.

For an RGB input and a $3 \times 3$ kernel, one filter contains

3 \cdot 3 \cdot 3 = 27

weights, plus usually one bias scalar.

Multiple Output Channels

A convolutional layer usually learns many filters. Each filter produces one output channel.

If a layer has $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels, its weight tensor has shape

C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w.

The output has shape

C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}.

For a batch, the full input and output shapes are

X \in \mathbb{R}^{B \times C_{\text{in}} \times H \times W},

Y \in \mathbb{R}^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}.

In PyTorch:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    padding=1,
)

print(conv.weight.shape)  # torch.Size([64, 3, 3, 3])
print(conv.bias.shape)    # torch.Size([64])

This layer has 64 filters. Each filter sees all 3 input channels and has spatial size $3 \times 3$ .

The number of weight parameters is

64 \cdot 3 \cdot 3 \cdot 3 = 1728.

With bias, the total number of parameters is

1728 + 64 = 1792.

Convolution as a Linear Operation

A convolutional layer is a linear operation before the activation function. If we fix the kernel weights, the output is a linear function of the input. If we fix the input, the output is also a linear function of the kernel weights.

This matters because convolutional neural networks are built from simple components:

\text{convolution} \rightarrow \text{nonlinearity} \rightarrow \text{normalization or pooling}.

Without nonlinear activation functions, stacking convolutional layers would still produce an overall linear map. The nonlinearity gives the network the ability to model more complex functions.

Parameter Sharing

A fully connected layer gives each input-output pair its own parameter. For images, this quickly becomes expensive.

Suppose an image has shape

3 \times 224 \times 224.

Flattening it gives

150528

input values. A fully connected layer with 1000 output units would require

150528 \cdot 1000 = 150{,}528{,}000

weights.

A convolutional layer with 64 output channels and $3 \times 3$ kernels requires only

64 \cdot 3 \cdot 3 \cdot 3 = 1728

weights.

This reduction comes from parameter sharing. The same kernel is used at every spatial location. This encodes the assumption that useful local patterns may appear anywhere in the image.

Sparse Connectivity

A convolutional layer also uses sparse connectivity. Each output value depends only on a small local region of the input.

For a $3 \times 3$ kernel, each output location depends on only nine spatial positions per input channel. A fully connected layer would connect each output value to every input value.

Sparse connectivity makes convolution efficient and gives it an inductive bias toward local pattern recognition.

The local region that influences one output value is called its receptive field.

Receptive Field

The receptive field of an output unit is the region of the input that can affect it.

In a single $3 \times 3$ convolution, each output value sees a $3 \times 3$ input patch. After stacking two $3 \times 3$ convolutions with stride $1$ , an output value can depend on a $5 \times 5$ region of the original input. After three such layers, it can depend on a $7 \times 7$ region.

Thus deep convolutional networks build large receptive fields from small kernels.

This is one reason $3 \times 3$ kernels are common. They are parameter efficient, but when stacked they can still capture large spatial context.

Convolution and Translation Equivariance

Convolution is translation equivariant. Informally, if the input shifts, the output shifts in the same way.

Let $T$ be a spatial shift operator and let $f$ be a convolution operation. Then convolution satisfies

f(TX) = T f(X).

This property means the same feature detector works across locations. If an edge moves from the left side of an image to the right side, the corresponding feature activation also moves.

Translation equivariance is different from translation invariance. Equivariance preserves location in a transformed form. Invariance removes sensitivity to location. Pooling and global averaging can help build invariance from equivariant feature maps.

Convolution in PyTorch

The main PyTorch class for two-dimensional convolution is torch.nn.Conv2d.

A typical layer is written as:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=3,
    stride=1,
    padding=1,
)

x = torch.randn(16, 3, 64, 64)
y = conv(x)

print(y.shape)  # torch.Size([16, 32, 64, 64])

The arguments mean:

Argument	Meaning
`in_channels`	Number of input channels
`out_channels`	Number of learned filters
`kernel_size`	Spatial size of each filter
`stride`	Step size of the sliding filter
`padding`	Boundary padding added to the input
`bias`	Whether to include a bias per output channel

The layer stores its learnable weights in conv.weight and its bias in conv.bias.

print(conv.weight.shape)  # torch.Size([32, 3, 3, 3])
print(conv.bias.shape)    # torch.Size([32])

Manual Convolution in PyTorch

A small convolution can be implemented directly to show the mechanics.

import torch

def conv2d_single_channel(x, k):
    h, w = x.shape
    kh, kw = k.shape

    out_h = h - kh + 1
    out_w = w - kw + 1

    y = torch.empty(out_h, out_w)

    for i in range(out_h):
        for j in range(out_w):
            patch = x[i:i + kh, j:j + kw]
            y[i, j] = (patch * k).sum()

    return y

x = torch.tensor([
    [1., 2., 3.],
    [4., 5., 6.],
    [7., 8., 9.],
])

k = torch.tensor([
    [1., 0.],
    [0., -1.],
])

y = conv2d_single_channel(x, k)
print(y)

The result is

tensor([[-4., -4.],
        [-4., -4.]])

This code is slow because it uses Python loops. PyTorch convolution kernels use optimized implementations that run efficiently on CPUs and GPUs.

Convolution, Bias, and Activation

A convolutional layer usually computes an affine operation:

Z_{b,o,i,j} = b_o + \sum_{c=0}^{C_{\text{in}}-1} \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} W_{o,c,u,v} X_{b,c,i+u,j+v}.

Here:

Symbol	Meaning
$b$	Batch index
$o$	Output channel index
$c$	Input channel index
$i,j$	Output spatial position
$u,v$	Kernel spatial position
$W$	Convolution weight tensor
$b_o$	Bias for output channel $o$

After this affine operation, a nonlinear activation is commonly applied:

Y = \sigma(Z).

For example:

layer = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
)

The convolution produces feature maps. The activation introduces nonlinearity.

Common Kernel Sizes

Common convolution kernel sizes include $1 \times 1$ , $3 \times 3$ , $5 \times 5$ , and $7 \times 7$ .

Kernel size	Common role
$1 \times 1$	Mix channels without spatial context
$3 \times 3$	Standard local pattern extraction
$5 \times 5$	Larger local context
$7 \times 7$	Early-stage large receptive field

A $1 \times 1$ convolution may look strange at first because it has no spatial extent beyond one pixel. Its role is to mix channel information at each location. If the input has $C_{\text{in}}$ channels and the output has $C_{\text{out}}$ channels, a $1 \times 1$ convolution applies a learned linear transformation from $\mathbb{R}^{C_{\text{in}}}$ to $\mathbb{R}^{C_{\text{out}}}$ at every spatial location.

This is useful for changing channel dimension, reducing computation, and combining features.

What a Convolution Learns

Early convolutional layers often learn simple local features such as edges, corners, color contrasts, and texture patterns. Deeper layers combine these into larger structures such as object parts and object-level representations.

The kernel values are learned by gradient descent. We do not hand-code the filters. During training, the network adjusts them to reduce the loss.

For a classification model, this means the convolutional filters become useful for separating classes. For a segmentation model, they become useful for predicting pixel-level labels. For a generative model, they become useful for constructing or denoising spatial structure.

Summary

A convolution applies a small learned kernel across spatial locations. The same kernel is reused at every location, producing parameter sharing. Each output value depends on a local input patch, producing sparse connectivity. These two properties make convolution efficient and effective for image-like data.

A two-dimensional convolution layer maps

[B, C_{\text{in}}, H, W]

[B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].

The output size depends on kernel size, padding, and stride. The number of parameters depends on input channels, output channels, and kernel size.

Convolution is the basic operation behind classical convolutional neural networks, many vision models, segmentation systems, image generation models, and parts of modern multimodal architectures.