Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling.
Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling. They are important when models must run on mobile devices, edge hardware, browsers, real-time systems, or large-scale training clusters.
A standard convolution is powerful, but expensive. If an input has channels and an output has channels, a convolution uses
weights. It also performs this many multiply-add operations at every output spatial location.
Efficient convolution methods reduce this cost by changing how spatial mixing and channel mixing are performed.
Cost of a Standard Convolution
For an input tensor
a standard convolution with output channels and a kernel produces
Ignoring bias, the number of parameters is
For one output image, the approximate multiply-add count is
Example:
The parameter count is
If the output feature map is , the convolution performs roughly
multiply-adds. This is large, and CNNs contain many such layers.
One by One Convolution
A convolution mixes channels without mixing neighboring spatial positions.
For each spatial location , the input channel vector is
A convolution applies the same linear map at every location:
The weight tensor has shape
So the parameter count is
In PyTorch:
import torch
import torch.nn as nn
conv = nn.Conv2d(
in_channels=256,
out_channels=64,
kernel_size=1,
)
x = torch.randn(8, 256, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 64, 32, 32])A convolution is commonly used to reduce channels before an expensive spatial convolution.
Bottleneck Convolutions
A bottleneck block uses convolutions to reduce and then restore channel dimension.
The pattern is
The first layer reduces channels. The middle layer performs spatial processing on fewer channels. The final layer expands channels.
Example:
class BottleneckConv(nn.Module):
def __init__(self, in_channels, mid_channels, out_channels):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(mid_channels),
nn.ReLU(),
nn.Conv2d(mid_channels, mid_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(mid_channels),
nn.ReLU(),
nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(out_channels),
)
def forward(self, x):
return self.net(x)Suppose
A direct convolution has
weights.
The bottleneck version has
weights.
This is much cheaper while still allowing spatial processing.
Grouped Convolution
Grouped convolution splits the input channels into groups. Each group is convolved separately. The outputs are then concatenated along the channel axis.
If there are groups, each group sees only
input channels and produces
output channels.
The parameter count becomes
In PyTorch:
conv = nn.Conv2d(
in_channels=64,
out_channels=128,
kernel_size=3,
padding=1,
groups=4,
)
x = torch.randn(8, 64, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 128, 32, 32])Both in_channels and out_channels must be divisible by groups.
Grouped convolution reduces computation, but it also limits communication between channel groups. Later layers, often convolutions, can mix information across groups.
Depthwise Convolution
Depthwise convolution is the extreme case of grouped convolution where
Each input channel gets its own spatial filter. There is no channel mixing inside the depthwise convolution.
In PyTorch:
depthwise = nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=3,
padding=1,
groups=64,
)
x = torch.randn(8, 64, 32, 32)
y = depthwise(x)
print(y.shape) # torch.Size([8, 64, 32, 32])For a depthwise convolution with channels, the parameter count is
This is far smaller than
for a standard convolution with the same input and output channel count.
Depthwise Separable Convolution
Depthwise separable convolution splits standard convolution into two parts:
- Depthwise convolution for spatial mixing.
- Pointwise convolution for channel mixing.
The block is
In PyTorch:
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(
in_channels,
in_channels,
kernel_size=3,
stride=stride,
padding=1,
groups=in_channels,
bias=False,
),
nn.BatchNorm2d(in_channels),
nn.ReLU(),
nn.Conv2d(
in_channels,
out_channels,
kernel_size=1,
bias=False,
),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
)
def forward(self, x):
return self.net(x)A standard convolution has
weights.
A depthwise separable convolution has
weights.
For
standard convolution has
weights.
Depthwise separable convolution has
weights.
This is about times fewer parameters.
Inverted Residual Blocks
An inverted residual block is common in mobile CNNs. It uses the opposite shape pattern from a classical bottleneck.
A classical bottleneck compresses channels, processes spatially, then expands:
An inverted residual block expands channels, applies depthwise convolution, then projects back:
The structure is usually:
class InvertedResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, expansion=4, stride=1):
super().__init__()
hidden_channels = in_channels * expansion
self.use_shortcut = stride == 1 and in_channels == out_channels
self.block = nn.Sequential(
nn.Conv2d(in_channels, hidden_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(hidden_channels),
nn.ReLU(),
nn.Conv2d(
hidden_channels,
hidden_channels,
kernel_size=3,
stride=stride,
padding=1,
groups=hidden_channels,
bias=False,
),
nn.BatchNorm2d(hidden_channels),
nn.ReLU(),
nn.Conv2d(hidden_channels, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(out_channels),
)
def forward(self, x):
y = self.block(x)
if self.use_shortcut:
y = y + x
return yThe expansion gives the block enough channel capacity. The depthwise layer handles spatial structure cheaply. The projection returns to a compact representation.
Channel Shuffle
Grouped convolutions reduce communication between groups. Channel shuffle is a simple operation that mixes channels across groups by reshaping and permuting the channel axis.
Suppose a tensor has shape
and groups. We can view the channels as
swap the group and within-group axes, then flatten back to
def channel_shuffle(x, groups):
b, c, h, w = x.shape
assert c % groups == 0
x = x.reshape(b, groups, c // groups, h, w)
x = x.transpose(1, 2)
x = x.reshape(b, c, h, w)
return xChannel shuffle is useful when grouped convolutions are stacked. It helps later groups receive information from earlier groups.
Dilated Convolution as Efficient Context
Dilated convolution increases receptive field without increasing kernel size. A dilation rate inserts gaps between kernel positions.
A kernel with dilation covers the spatial span of a area but uses only nine weights.
In PyTorch:
conv = nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=3,
padding=2,
dilation=2,
)
x = torch.randn(8, 64, 32, 32)
y = conv(x)
print(y.shape) # torch.Size([8, 64, 32, 32])Dilated convolutions are useful in segmentation, detection, audio models, and any task that needs larger context without aggressive downsampling.
Factorized Convolutions
A large convolution can sometimes be factorized into smaller operations.
For example, a convolution can be replaced by two convolutions. This adds an extra nonlinearity and often reduces parameters.
Another factorization replaces a convolution with:
For a convolution, this changes the spatial parameter count from
to
Example:
factorized = nn.Sequential(
nn.Conv2d(64, 64, kernel_size=(7, 1), padding=(3, 0), bias=False),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=(1, 7), padding=(0, 3), bias=False),
nn.BatchNorm2d(64),
nn.ReLU(),
)Factorized convolutions reduce computation while preserving a large directional receptive field.
Squeeze-and-Excitation
Squeeze-and-excitation blocks improve channel efficiency by letting the network reweight channels dynamically.
First, global average pooling summarizes each channel:
Then a small network predicts channel weights. These weights scale the original feature maps.
class SqueezeExcitation(nn.Module):
def __init__(self, channels, reduction=4):
super().__init__()
hidden = channels // reduction
self.pool = nn.AdaptiveAvgPool2d((1, 1))
self.gate = nn.Sequential(
nn.Flatten(),
nn.Linear(channels, hidden),
nn.ReLU(),
nn.Linear(hidden, channels),
nn.Sigmoid(),
)
def forward(self, x):
b, c, h, w = x.shape
weights = self.pool(x)
weights = self.gate(weights)
weights = weights.reshape(b, c, 1, 1)
return x * weightsThis block adds modest cost and can improve accuracy by emphasizing useful channels and suppressing less useful ones.
Efficient Blocks in Practice
An efficient convolutional block often combines several ideas:
class EfficientConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, expansion=4):
super().__init__()
hidden = in_channels * expansion
self.use_shortcut = stride == 1 and in_channels == out_channels
self.block = nn.Sequential(
nn.Conv2d(in_channels, hidden, kernel_size=1, bias=False),
nn.BatchNorm2d(hidden),
nn.ReLU(),
nn.Conv2d(
hidden,
hidden,
kernel_size=3,
stride=stride,
padding=1,
groups=hidden,
bias=False,
),
nn.BatchNorm2d(hidden),
nn.ReLU(),
SqueezeExcitation(hidden),
nn.Conv2d(hidden, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(out_channels),
)
def forward(self, x):
y = self.block(x)
if self.use_shortcut:
y = y + x
return yThis block uses pointwise expansion, depthwise spatial filtering, channel reweighting, pointwise projection, and an optional residual path.
Efficiency Is Hardware-Dependent
Fewer parameters do not always mean faster inference. Actual speed depends on hardware, memory bandwidth, kernel implementation, batch size, tensor layout, and compiler support.
Depthwise convolutions have few arithmetic operations, but on some hardware they may be memory-bound. A standard convolution may run faster than expected because it maps well to optimized matrix multiplication kernels.
When optimizing a CNN, measure real latency:
import time
import torch
def benchmark(module, x, steps=100):
module.eval()
# Warmup
with torch.no_grad():
for _ in range(10):
module(x)
if x.is_cuda:
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(steps):
module(x)
if x.is_cuda:
torch.cuda.synchronize()
return (time.time() - start) / steps
x = torch.randn(1, 64, 128, 128)
module = nn.Conv2d(64, 128, kernel_size=3, padding=1)
print(benchmark(module, x))For deployment, benchmark the target device, not only the development machine.
Choosing an Efficient Convolution
The right efficient convolution depends on the constraint.
| Constraint | Useful method |
|---|---|
| Reduce parameters | Bottlenecks, depthwise separable convolution |
| Reduce FLOPs | Depthwise separable convolution, grouped convolution |
| Increase receptive field | Dilated convolution, factorized large kernels |
| Mobile inference | Inverted residual blocks |
| Preserve accuracy | Squeeze-and-excitation, residual connections |
| Reduce memory | Smaller channels, lower resolution, checkpointing |
| Improve latency | Benchmark hardware-specific kernels |
For small models, overhead may dominate. For large models, arithmetic cost may dominate. The best choice should be tested on the actual workload.
Summary
Efficient convolutions reduce the cost of CNNs by changing how channels and spatial information are processed. A convolution mixes channels cheaply. Grouped and depthwise convolutions reduce channel coupling. Depthwise separable convolution separates spatial filtering from channel mixing. Bottleneck and inverted residual blocks use these operations to build efficient networks.
Efficient CNN design is a tradeoff among accuracy, parameter count, FLOPs, memory traffic, and hardware latency. The mathematical operation matters, but the final decision should be based on measured performance on the target device.