Skip to content

Efficient Convolutions

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling.

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling. They are important when models must run on mobile devices, edge hardware, browsers, real-time systems, or large-scale training clusters.

A standard convolution is powerful, but expensive. If an input has CinC_{\text{in}} channels and an output has CoutC_{\text{out}} channels, a k×kk \times k convolution uses

CoutCink2 C_{\text{out}} C_{\text{in}} k^2

weights. It also performs this many multiply-add operations at every output spatial location.

Efficient convolution methods reduce this cost by changing how spatial mixing and channel mixing are performed.

Cost of a Standard Convolution

For an input tensor

XRB×Cin×H×W, X \in \mathbb{R}^{B \times C_{\text{in}} \times H \times W},

a standard convolution with CoutC_{\text{out}} output channels and a k×kk \times k kernel produces

YRB×Cout×Hout×Wout. Y \in \mathbb{R}^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}.

Ignoring bias, the number of parameters is

CoutCink2. C_{\text{out}} C_{\text{in}} k^2.

For one output image, the approximate multiply-add count is

HoutWoutCoutCink2. H_{\text{out}} W_{\text{out}} C_{\text{out}} C_{\text{in}} k^2.

Example:

Cin=64,Cout=128,k=3. C_{\text{in}}=64,\quad C_{\text{out}}=128,\quad k=3.

The parameter count is

1286432=73,728. 128 \cdot 64 \cdot 3^2 = 73{,}728.

If the output feature map is 56×5656 \times 56, the convolution performs roughly

5656128649 56 \cdot 56 \cdot 128 \cdot 64 \cdot 9

multiply-adds. This is large, and CNNs contain many such layers.

One by One Convolution

A 1×11 \times 1 convolution mixes channels without mixing neighboring spatial positions.

For each spatial location (i,j)(i,j), the input channel vector is

xi,jRCin. x_{i,j} \in \mathbb{R}^{C_{\text{in}}}.

A 1×11 \times 1 convolution applies the same linear map at every location:

yi,j=Wxi,j+b. y_{i,j} = W x_{i,j} + b.

The weight tensor has shape

[Cout,Cin,1,1]. [C_{\text{out}}, C_{\text{in}}, 1, 1].

So the parameter count is

CoutCin. C_{\text{out}} C_{\text{in}}.

In PyTorch:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=256,
    out_channels=64,
    kernel_size=1,
)

x = torch.randn(8, 256, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

A 1×11 \times 1 convolution is commonly used to reduce channels before an expensive spatial convolution.

Bottleneck Convolutions

A bottleneck block uses 1×11 \times 1 convolutions to reduce and then restore channel dimension.

The pattern is

1×13×31×1. 1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1.

The first layer reduces channels. The middle layer performs spatial processing on fewer channels. The final layer expands channels.

Example:

class BottleneckConv(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, mid_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        return self.net(x)

Suppose

Cin=Cout=256,Cmid=64. C_{\text{in}} = C_{\text{out}} = 256, \quad C_{\text{mid}} = 64.

A direct 3×33 \times 3 convolution has

2562569=589,824 256 \cdot 256 \cdot 9 = 589{,}824

weights.

The bottleneck version has

25664+64649+64256=69,632 256 \cdot 64 + 64 \cdot 64 \cdot 9 + 64 \cdot 256 = 69{,}632

weights.

This is much cheaper while still allowing spatial processing.

Grouped Convolution

Grouped convolution splits the input channels into groups. Each group is convolved separately. The outputs are then concatenated along the channel axis.

If there are gg groups, each group sees only

Cing \frac{C_{\text{in}}}{g}

input channels and produces

Coutg \frac{C_{\text{out}}}{g}

output channels.

The parameter count becomes

CoutCink2g. \frac{C_{\text{out}} C_{\text{in}} k^2}{g}.

In PyTorch:

conv = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    padding=1,
    groups=4,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 128, 32, 32])

Both in_channels and out_channels must be divisible by groups.

Grouped convolution reduces computation, but it also limits communication between channel groups. Later layers, often 1×11 \times 1 convolutions, can mix information across groups.

Depthwise Convolution

Depthwise convolution is the extreme case of grouped convolution where

g=Cin. g = C_{\text{in}}.

Each input channel gets its own spatial filter. There is no channel mixing inside the depthwise convolution.

In PyTorch:

depthwise = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=1,
    groups=64,
)

x = torch.randn(8, 64, 32, 32)
y = depthwise(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

For a 3×33 \times 3 depthwise convolution with CC channels, the parameter count is

C32. C \cdot 3^2.

This is far smaller than

C232 C^2 \cdot 3^2

for a standard convolution with the same input and output channel count.

Depthwise Separable Convolution

Depthwise separable convolution splits standard convolution into two parts:

  1. Depthwise convolution for spatial mixing.
  2. Pointwise 1×11 \times 1 convolution for channel mixing.

The block is

depthwise 3×3pointwise 1×1. \text{depthwise } 3 \times 3 \rightarrow \text{pointwise } 1 \times 1.

In PyTorch:

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(
                in_channels,
                in_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=in_channels,
                bias=False,
            ),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),

            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

A standard 3×33 \times 3 convolution has

CinCout9 C_{\text{in}} C_{\text{out}} 9

weights.

A depthwise separable convolution has

Cin9+CinCout C_{\text{in}} 9 + C_{\text{in}} C_{\text{out}}

weights.

For

Cin=64,Cout=128, C_{\text{in}}=64,\quad C_{\text{out}}=128,

standard convolution has

641289=73,728 64 \cdot 128 \cdot 9 = 73{,}728

weights.

Depthwise separable convolution has

649+64128=8,768 64 \cdot 9 + 64 \cdot 128 = 8{,}768

weights.

This is about 8.48.4 times fewer parameters.

Inverted Residual Blocks

An inverted residual block is common in mobile CNNs. It uses the opposite shape pattern from a classical bottleneck.

A classical bottleneck compresses channels, processes spatially, then expands:

widenarrowwide. \text{wide} \rightarrow \text{narrow} \rightarrow \text{wide}.

An inverted residual block expands channels, applies depthwise convolution, then projects back:

narrowwidenarrow. \text{narrow} \rightarrow \text{wide} \rightarrow \text{narrow}.

The structure is usually:

1×1 expansion3×3 depthwise1×1 projection. 1 \times 1 \text{ expansion} \rightarrow 3 \times 3 \text{ depthwise} \rightarrow 1 \times 1 \text{ projection}.
class InvertedResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, expansion=4, stride=1):
        super().__init__()

        hidden_channels = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(
                hidden_channels,
                hidden_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden_channels,
                bias=False,
            ),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(hidden_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)
        if self.use_shortcut:
            y = y + x
        return y

The expansion gives the block enough channel capacity. The depthwise layer handles spatial structure cheaply. The projection returns to a compact representation.

Channel Shuffle

Grouped convolutions reduce communication between groups. Channel shuffle is a simple operation that mixes channels across groups by reshaping and permuting the channel axis.

Suppose a tensor has shape

[B,C,H,W] [B, C, H, W]

and gg groups. We can view the channels as

[B,g,C/g,H,W], [B, g, C/g, H, W],

swap the group and within-group axes, then flatten back to

[B,C,H,W]. [B, C, H, W].
def channel_shuffle(x, groups):
    b, c, h, w = x.shape
    assert c % groups == 0

    x = x.reshape(b, groups, c // groups, h, w)
    x = x.transpose(1, 2)
    x = x.reshape(b, c, h, w)
    return x

Channel shuffle is useful when grouped convolutions are stacked. It helps later groups receive information from earlier groups.

Dilated Convolution as Efficient Context

Dilated convolution increases receptive field without increasing kernel size. A dilation rate dd inserts gaps between kernel positions.

A 3×33 \times 3 kernel with dilation 22 covers the spatial span of a 5×55 \times 5 area but uses only nine weights.

In PyTorch:

conv = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=2,
    dilation=2,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

Dilated convolutions are useful in segmentation, detection, audio models, and any task that needs larger context without aggressive downsampling.

Factorized Convolutions

A large convolution can sometimes be factorized into smaller operations.

For example, a 5×55 \times 5 convolution can be replaced by two 3×33 \times 3 convolutions. This adds an extra nonlinearity and often reduces parameters.

Another factorization replaces a k×kk \times k convolution with:

k×11×k. k \times 1 \rightarrow 1 \times k.

For a 7×77 \times 7 convolution, this changes the spatial parameter count from

49 49

to

14. 14.

Example:

factorized = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=(7, 1), padding=(3, 0), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),

    nn.Conv2d(64, 64, kernel_size=(1, 7), padding=(0, 3), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Factorized convolutions reduce computation while preserving a large directional receptive field.

Squeeze-and-Excitation

Squeeze-and-excitation blocks improve channel efficiency by letting the network reweight channels dynamically.

First, global average pooling summarizes each channel:

zc=1HWijXc,i,j. z_c = \frac{1}{HW} \sum_i \sum_j X_{c,i,j}.

Then a small network predicts channel weights. These weights scale the original feature maps.

class SqueezeExcitation(nn.Module):
    def __init__(self, channels, reduction=4):
        super().__init__()

        hidden = channels // reduction

        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.gate = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels, hidden),
            nn.ReLU(),
            nn.Linear(hidden, channels),
            nn.Sigmoid(),
        )

    def forward(self, x):
        b, c, h, w = x.shape

        weights = self.pool(x)
        weights = self.gate(weights)
        weights = weights.reshape(b, c, 1, 1)

        return x * weights

This block adds modest cost and can improve accuracy by emphasizing useful channels and suppressing less useful ones.

Efficient Blocks in Practice

An efficient convolutional block often combines several ideas:

class EfficientConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, expansion=4):
        super().__init__()

        hidden = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            nn.Conv2d(
                hidden,
                hidden,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden,
                bias=False,
            ),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            SqueezeExcitation(hidden),

            nn.Conv2d(hidden, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)

        if self.use_shortcut:
            y = y + x

        return y

This block uses pointwise expansion, depthwise spatial filtering, channel reweighting, pointwise projection, and an optional residual path.

Efficiency Is Hardware-Dependent

Fewer parameters do not always mean faster inference. Actual speed depends on hardware, memory bandwidth, kernel implementation, batch size, tensor layout, and compiler support.

Depthwise convolutions have few arithmetic operations, but on some hardware they may be memory-bound. A standard convolution may run faster than expected because it maps well to optimized matrix multiplication kernels.

When optimizing a CNN, measure real latency:

import time
import torch

def benchmark(module, x, steps=100):
    module.eval()

    # Warmup
    with torch.no_grad():
        for _ in range(10):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    start = time.time()

    with torch.no_grad():
        for _ in range(steps):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    return (time.time() - start) / steps

x = torch.randn(1, 64, 128, 128)
module = nn.Conv2d(64, 128, kernel_size=3, padding=1)

print(benchmark(module, x))

For deployment, benchmark the target device, not only the development machine.

Choosing an Efficient Convolution

The right efficient convolution depends on the constraint.

ConstraintUseful method
Reduce parametersBottlenecks, depthwise separable convolution
Reduce FLOPsDepthwise separable convolution, grouped convolution
Increase receptive fieldDilated convolution, factorized large kernels
Mobile inferenceInverted residual blocks
Preserve accuracySqueeze-and-excitation, residual connections
Reduce memorySmaller channels, lower resolution, checkpointing
Improve latencyBenchmark hardware-specific kernels

For small models, overhead may dominate. For large models, arithmetic cost may dominate. The best choice should be tested on the actual workload.

Summary

Efficient convolutions reduce the cost of CNNs by changing how channels and spatial information are processed. A 1×11 \times 1 convolution mixes channels cheaply. Grouped and depthwise convolutions reduce channel coupling. Depthwise separable convolution separates spatial filtering from channel mixing. Bottleneck and inverted residual blocks use these operations to build efficient networks.

Efficient CNN design is a tradeoff among accuracy, parameter count, FLOPs, memory traffic, and hardware latency. The mathematical operation matters, but the final decision should be based on measured performance on the target device.