Pooling Layers

Pooling is a downsampling operation used in convolutional neural networks. It reduces the spatial size of a feature map while keeping the most important local information. A pooling layer has no learned weights. It applies a fixed rule, such as taking the maximum or average value inside a local window.

Pooling is commonly used after convolution and activation:

\text{convolution} \rightarrow \text{activation} \rightarrow \text{pooling}.

The purpose is to reduce computation, increase the receptive field of later layers, and make the representation less sensitive to small spatial shifts.

Why Pooling Is Used

A convolutional layer preserves spatial structure. If the input has shape

[B, C, H, W],

then a convolution may produce another tensor with similar spatial size:

[B, C_{\text{out}}, H, W].

If every layer keeps the same height and width, computation remains expensive. Deeper layers also need large receptive fields to understand larger objects. Pooling reduces height and width, so later layers operate on smaller feature maps.

For example, a $2 \times 2$ pooling layer with stride $2$ maps

[B, C, 32, 32]

[B, C, 16, 16].

The number of spatial positions is reduced by a factor of four.

Max Pooling

Max pooling takes the largest value inside each local window.

For a $2 \times 2$ window,

\begin{bmatrix} 1 & 3 \\ 2 & 0 \end{bmatrix},

max pooling returns

3.

The operation keeps the strongest activation in each region. This is useful because many convolutional filters act like feature detectors. A high activation means the feature is present. Max pooling keeps the strongest evidence for that feature.

For an input $X$ , max pooling computes

Y_{c,i,j} = \max_{0 \leq u < k_h,\;0 \leq v < k_w} X_{c,\, i s_h + u,\, j s_w + v}.

Here $k_h$ and $k_w$ are the pooling window size, and $s_h$ , $s_w$ are the strides.

In PyTorch:

import torch
import torch.nn as nn

pool = nn.MaxPool2d(kernel_size=2, stride=2)

x = torch.randn(8, 16, 32, 32)
y = pool(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

The channel count stays the same. Pooling changes the spatial dimensions, not the number of channels.

A Small Max Pooling Example

Consider one feature map:

X = \begin{bmatrix} 1 & 3 & 2 & 4 \\ 5 & 6 & 1 & 2 \\ 0 & 1 & 7 & 8 \\ 3 & 2 & 4 & 9 \end{bmatrix}.

Apply $2 \times 2$ max pooling with stride $2$ . The four windows are:

\begin{bmatrix} 1 & 3 \\ 5 & 6 \end{bmatrix}, \quad \begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix}, \quad \begin{bmatrix} 0 & 1 \\ 3 & 2 \end{bmatrix}, \quad \begin{bmatrix} 7 & 8 \\ 4 & 9 \end{bmatrix}.

Taking the maximum from each window gives

Y = \begin{bmatrix} 6 & 4 \\ 3 & 9 \end{bmatrix}.

The spatial size is reduced from $4 \times 4$ to $2 \times 2$ .

Average Pooling

Average pooling returns the mean value inside each local window.

For a $2 \times 2$ window,

\begin{bmatrix} 1 & 3 \\ 2 & 0 \end{bmatrix},

average pooling returns

\frac{1 + 3 + 2 + 0}{4} = 1.5.

Average pooling smooths local information. It keeps the average response rather than the strongest response.

In PyTorch:

pool = nn.AvgPool2d(kernel_size=2, stride=2)

x = torch.randn(8, 16, 32, 32)
y = pool(x)

print(y.shape)  # torch.Size([8, 16, 16, 16])

Max pooling is often used when we want to detect whether a feature is present. Average pooling is often used when we want to summarize a region.

Pooling Output Shape

Pooling uses the same basic output-size rule as convolution. For input height $H$ , input width $W$ , window size $k_h \times k_w$ , padding $p_h, p_w$ , and stride $s_h, s_w$ ,

$$ H_{\text{out}} = \left\lfloor \frac{H + 2p_h - k_h}{s_h} \right\rfloor

1, $$

$$ W_{\text{out}} = \left\lfloor \frac{W + 2p_w - k_w}{s_w} \right\rfloor

For a $32 \times 32$ input, $2 \times 2$ pooling, stride $2$ , and no padding:

$$ H_{\text{out}} = \left\lfloor \frac{32 - 2}{2} \right\rfloor

The output is $16 \times 16$ .

Pooling Does Not Mix Channels

A standard pooling layer operates independently on each channel. It does not combine information across channels.

If the input has shape

[B, C, H, W],

then pooling produces

[B, C, H_{\text{out}}, W_{\text{out}}].

The channel count remains $C$ .

This differs from convolution. A convolutional layer can combine information across channels because each output channel has weights over all input channels. Pooling only summarizes local spatial neighborhoods within each channel.

Translation Robustness

Pooling makes a representation less sensitive to small spatial shifts.

Suppose a convolutional filter detects an edge. If the edge moves by one pixel, the activation may move by one pixel. Max pooling over a nearby region can produce the same output as long as the strong activation remains inside the pooling window.

This gives a limited form of translation robustness. It helps classification models because the exact pixel location of a feature often matters less than whether the feature is present.

Pooling does not create complete translation invariance. Large shifts still change the representation. Pooling only reduces sensitivity over small local regions.

Overlapping Pooling

Pooling windows can overlap. This happens when the stride is smaller than the window size.

For example:

pool = nn.MaxPool2d(kernel_size=3, stride=2)

Here each window is $3 \times 3$ , but the window moves two pixels at a time. Neighboring windows overlap.

Overlapping pooling can reduce spatial size while keeping smoother transitions between neighboring outputs. It was used in some classical CNN architectures.

Global Average Pooling

Global average pooling averages each channel over the full spatial extent.

X \in \mathbb{R}^{B \times C \times H \times W},

then global average pooling produces

Y \in \mathbb{R}^{B \times C}.

For each batch item and channel,

Y_{b,c} = \frac{1}{HW} \sum_{i=0}^{H-1} \sum_{j=0}^{W-1} X_{b,c,i,j}.

This converts each feature map into one number.

Global average pooling is common near the end of a CNN classifier. It replaces large fully connected heads with a simple spatial summary.

In PyTorch:

pool = nn.AdaptiveAvgPool2d((1, 1))

x = torch.randn(8, 512, 7, 7)
y = pool(x)

print(y.shape)  # torch.Size([8, 512, 1, 1])

y = y.flatten(1)
print(y.shape)  # torch.Size([8, 512])

The result can be passed to a linear classifier.

classifier = nn.Linear(512, 1000)
logits = classifier(y)

print(logits.shape)  # torch.Size([8, 1000])

Adaptive Pooling

Adaptive pooling lets the user specify the desired output size instead of the kernel size and stride.

For example:

pool = nn.AdaptiveAvgPool2d((7, 7))

x = torch.randn(8, 512, 14, 14)
y = pool(x)

print(y.shape)  # torch.Size([8, 512, 7, 7])

The same layer can also handle other input sizes:

x = torch.randn(8, 512, 20, 20)
y = pool(x)

print(y.shape)  # torch.Size([8, 512, 7, 7])

Adaptive pooling is useful when input images may have different sizes, but the classifier head expects a fixed spatial size.

Common PyTorch adaptive pooling layers are:

Layer	Meaning
`nn.AdaptiveAvgPool2d`	Average pooling to a target output size
`nn.AdaptiveMaxPool2d`	Max pooling to a target output size

Pooling and Backpropagation

Pooling participates in backpropagation.

For average pooling, gradients are distributed evenly across the entries in the pooling window. If a $2 \times 2$ average pool receives gradient $g$ , each input entry receives

\frac{g}{4}.

For max pooling, the gradient flows only to the input entry that achieved the maximum. The other entries receive zero gradient.

For example, if

\begin{bmatrix} 1 & 3 \\ 5 & 2 \end{bmatrix}

is max-pooled, the output is $5$ . During backpropagation, the gradient flows to the position containing $5$ .

This behavior is one reason max pooling emphasizes the strongest feature responses.

Strided Convolution Versus Pooling

Pooling is not the only way to downsample. A convolution with stride greater than $1$ can also reduce spatial size.

Example:

downsample = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    stride=2,
    padding=1,
)

This maps

[B, 64, 32, 32]

[B, 128, 16, 16].

Unlike pooling, strided convolution has learned weights and can change the number of channels. Modern architectures often use strided convolutions instead of pooling in some stages.

The distinction is:

Operation	Learned weights	Changes channels	Downsamples
Max pooling	No	No	Yes
Average pooling	No	No	Yes
Strided convolution	Yes	Yes	Yes

Pooling gives a fixed downsampling rule. Strided convolution learns how to downsample.

Pooling in Modern CNNs

Early CNNs used pooling heavily. Architectures such as LeNet, AlexNet, and VGG use pooling to reduce spatial size between convolutional blocks.

Later architectures often reduce explicit pooling. ResNet uses pooling early, then relies heavily on strided convolutions inside residual stages. Many modern CNNs and vision transformers use learned downsampling layers.

Global average pooling remains common. It provides a clean way to convert spatial feature maps into vector representations without adding many parameters.

Common Mistakes

A common mistake is forgetting that pooling does not change the channel dimension. If an input has 64 channels, max pooling still returns 64 channels.

Another common mistake is using too much pooling too early. This destroys spatial information before the network has learned useful features. For dense prediction tasks such as segmentation, excessive pooling can harm localization.

A third mistake is assuming pooling creates full invariance. Pooling gives only local robustness. The model may still be sensitive to object position, scale, rotation, and background.

Summary

Pooling reduces the spatial size of feature maps using a fixed local rule. Max pooling keeps the strongest activation in each window. Average pooling computes a local mean. Global average pooling summarizes each entire feature map into one value.

Pooling reduces computation, increases effective receptive field, and gives limited robustness to small shifts. It preserves the number of channels and changes only the spatial dimensions. In modern CNNs, pooling is often used together with or replaced by strided convolution, depending on the architecture.