Efficient Convolutions

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling. They are important when models must run on mobile devices, edge hardware, browsers, real-time systems, or large-scale training clusters.

A standard convolution is powerful, but expensive. If an input has $C_{\text{in}}$ channels and an output has $C_{\text{out}}$ channels, a $k \times k$ convolution uses

C_{\text{out}} C_{\text{in}} k^2

weights. It also performs this many multiply-add operations at every output spatial location.

Efficient convolution methods reduce this cost by changing how spatial mixing and channel mixing are performed.

Cost of a Standard Convolution

For an input tensor

X \in \mathbb{R}^{B \times C_{\text{in}} \times H \times W},

a standard convolution with $C_{\text{out}}$ output channels and a $k \times k$ kernel produces

Y \in \mathbb{R}^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}.

Ignoring bias, the number of parameters is

C_{\text{out}} C_{\text{in}} k^2.

For one output image, the approximate multiply-add count is

H_{\text{out}} W_{\text{out}} C_{\text{out}} C_{\text{in}} k^2.

Example:

C_{\text{in}}=64,\quad C_{\text{out}}=128,\quad k=3.

The parameter count is

128 \cdot 64 \cdot 3^2 = 73{,}728.

If the output feature map is $56 \times 56$ , the convolution performs roughly

56 \cdot 56 \cdot 128 \cdot 64 \cdot 9

multiply-adds. This is large, and CNNs contain many such layers.

One by One Convolution

A $1 \times 1$ convolution mixes channels without mixing neighboring spatial positions.

For each spatial location $(i,j)$ , the input channel vector is

x_{i,j} \in \mathbb{R}^{C_{\text{in}}}.

A $1 \times 1$ convolution applies the same linear map at every location:

y_{i,j} = W x_{i,j} + b.

The weight tensor has shape

[C_{\text{out}}, C_{\text{in}}, 1, 1].

So the parameter count is

C_{\text{out}} C_{\text{in}}.

In PyTorch:

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=256,
    out_channels=64,
    kernel_size=1,
)

x = torch.randn(8, 256, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

A $1 \times 1$ convolution is commonly used to reduce channels before an expensive spatial convolution.

Bottleneck Convolutions

A bottleneck block uses $1 \times 1$ convolutions to reduce and then restore channel dimension.

The pattern is

1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1.

The first layer reduces channels. The middle layer performs spatial processing on fewer channels. The final layer expands channels.

Example:

class BottleneckConv(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, mid_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        return self.net(x)

Suppose

C_{\text{in}} = C_{\text{out}} = 256, \quad C_{\text{mid}} = 64.

A direct $3 \times 3$ convolution has

256 \cdot 256 \cdot 9 = 589{,}824

weights.

The bottleneck version has

256 \cdot 64 + 64 \cdot 64 \cdot 9 + 64 \cdot 256 = 69{,}632

weights.

This is much cheaper while still allowing spatial processing.

Grouped Convolution

Grouped convolution splits the input channels into groups. Each group is convolved separately. The outputs are then concatenated along the channel axis.

If there are $g$ groups, each group sees only

\frac{C_{\text{in}}}{g}

input channels and produces

\frac{C_{\text{out}}}{g}

output channels.

The parameter count becomes

\frac{C_{\text{out}} C_{\text{in}} k^2}{g}.

In PyTorch:

conv = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    padding=1,
    groups=4,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 128, 32, 32])

Both in_channels and out_channels must be divisible by groups.

Grouped convolution reduces computation, but it also limits communication between channel groups. Later layers, often $1 \times 1$ convolutions, can mix information across groups.

Depthwise Convolution

Depthwise convolution is the extreme case of grouped convolution where

g = C_{\text{in}}.

Each input channel gets its own spatial filter. There is no channel mixing inside the depthwise convolution.

In PyTorch:

depthwise = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=1,
    groups=64,
)

x = torch.randn(8, 64, 32, 32)
y = depthwise(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

For a $3 \times 3$ depthwise convolution with $C$ channels, the parameter count is

C \cdot 3^2.

This is far smaller than

C^2 \cdot 3^2

for a standard convolution with the same input and output channel count.

Depthwise Separable Convolution

Depthwise separable convolution splits standard convolution into two parts:

Depthwise convolution for spatial mixing.
Pointwise $1 \times 1$ convolution for channel mixing.

The block is

\text{depthwise } 3 \times 3 \rightarrow \text{pointwise } 1 \times 1.

In PyTorch:

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(
                in_channels,
                in_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=in_channels,
                bias=False,
            ),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),

            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

A standard $3 \times 3$ convolution has

C_{\text{in}} C_{\text{out}} 9

weights.

A depthwise separable convolution has

C_{\text{in}} 9 + C_{\text{in}} C_{\text{out}}

weights.

For

C_{\text{in}}=64,\quad C_{\text{out}}=128,

standard convolution has

64 \cdot 128 \cdot 9 = 73{,}728

weights.

Depthwise separable convolution has

64 \cdot 9 + 64 \cdot 128 = 8{,}768

weights.

This is about $8.4$ times fewer parameters.

Inverted Residual Blocks

An inverted residual block is common in mobile CNNs. It uses the opposite shape pattern from a classical bottleneck.

A classical bottleneck compresses channels, processes spatially, then expands:

\text{wide} \rightarrow \text{narrow} \rightarrow \text{wide}.

An inverted residual block expands channels, applies depthwise convolution, then projects back:

\text{narrow} \rightarrow \text{wide} \rightarrow \text{narrow}.

The structure is usually:

1 \times 1 \text{ expansion} \rightarrow 3 \times 3 \text{ depthwise} \rightarrow 1 \times 1 \text{ projection}.

class InvertedResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, expansion=4, stride=1):
        super().__init__()

        hidden_channels = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(
                hidden_channels,
                hidden_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden_channels,
                bias=False,
            ),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(hidden_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)
        if self.use_shortcut:
            y = y + x
        return y

The expansion gives the block enough channel capacity. The depthwise layer handles spatial structure cheaply. The projection returns to a compact representation.

Channel Shuffle

Grouped convolutions reduce communication between groups. Channel shuffle is a simple operation that mixes channels across groups by reshaping and permuting the channel axis.

Suppose a tensor has shape

[B, C, H, W]

and $g$ groups. We can view the channels as

[B, g, C/g, H, W],

swap the group and within-group axes, then flatten back to

[B, C, H, W].

def channel_shuffle(x, groups):
    b, c, h, w = x.shape
    assert c % groups == 0

    x = x.reshape(b, groups, c // groups, h, w)
    x = x.transpose(1, 2)
    x = x.reshape(b, c, h, w)
    return x

Channel shuffle is useful when grouped convolutions are stacked. It helps later groups receive information from earlier groups.

Dilated Convolution as Efficient Context

Dilated convolution increases receptive field without increasing kernel size. A dilation rate $d$ inserts gaps between kernel positions.

A $3 \times 3$ kernel with dilation $2$ covers the spatial span of a $5 \times 5$ area but uses only nine weights.

In PyTorch:

conv = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=2,
    dilation=2,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])

Dilated convolutions are useful in segmentation, detection, audio models, and any task that needs larger context without aggressive downsampling.

Factorized Convolutions

A large convolution can sometimes be factorized into smaller operations.

For example, a $5 \times 5$ convolution can be replaced by two $3 \times 3$ convolutions. This adds an extra nonlinearity and often reduces parameters.

Another factorization replaces a $k \times k$ convolution with:

k \times 1 \rightarrow 1 \times k.

For a $7 \times 7$ convolution, this changes the spatial parameter count from

49

14.

Example:

factorized = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=(7, 1), padding=(3, 0), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),

    nn.Conv2d(64, 64, kernel_size=(1, 7), padding=(0, 3), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Factorized convolutions reduce computation while preserving a large directional receptive field.

Squeeze-and-Excitation

Squeeze-and-excitation blocks improve channel efficiency by letting the network reweight channels dynamically.

First, global average pooling summarizes each channel:

z_c = \frac{1}{HW} \sum_i \sum_j X_{c,i,j}.

Then a small network predicts channel weights. These weights scale the original feature maps.

class SqueezeExcitation(nn.Module):
    def __init__(self, channels, reduction=4):
        super().__init__()

        hidden = channels // reduction

        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.gate = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels, hidden),
            nn.ReLU(),
            nn.Linear(hidden, channels),
            nn.Sigmoid(),
        )

    def forward(self, x):
        b, c, h, w = x.shape

        weights = self.pool(x)
        weights = self.gate(weights)
        weights = weights.reshape(b, c, 1, 1)

        return x * weights

This block adds modest cost and can improve accuracy by emphasizing useful channels and suppressing less useful ones.

Efficient Blocks in Practice

An efficient convolutional block often combines several ideas:

class EfficientConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, expansion=4):
        super().__init__()

        hidden = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            nn.Conv2d(
                hidden,
                hidden,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden,
                bias=False,
            ),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            SqueezeExcitation(hidden),

            nn.Conv2d(hidden, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)

        if self.use_shortcut:
            y = y + x

        return y

This block uses pointwise expansion, depthwise spatial filtering, channel reweighting, pointwise projection, and an optional residual path.

Efficiency Is Hardware-Dependent

Fewer parameters do not always mean faster inference. Actual speed depends on hardware, memory bandwidth, kernel implementation, batch size, tensor layout, and compiler support.

Depthwise convolutions have few arithmetic operations, but on some hardware they may be memory-bound. A standard convolution may run faster than expected because it maps well to optimized matrix multiplication kernels.

When optimizing a CNN, measure real latency:

import time
import torch

def benchmark(module, x, steps=100):
    module.eval()

    # Warmup
    with torch.no_grad():
        for _ in range(10):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    start = time.time()

    with torch.no_grad():
        for _ in range(steps):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    return (time.time() - start) / steps

x = torch.randn(1, 64, 128, 128)
module = nn.Conv2d(64, 128, kernel_size=3, padding=1)

print(benchmark(module, x))

For deployment, benchmark the target device, not only the development machine.

Choosing an Efficient Convolution

The right efficient convolution depends on the constraint.

Constraint	Useful method
Reduce parameters	Bottlenecks, depthwise separable convolution
Reduce FLOPs	Depthwise separable convolution, grouped convolution
Increase receptive field	Dilated convolution, factorized large kernels
Mobile inference	Inverted residual blocks
Preserve accuracy	Squeeze-and-excitation, residual connections
Reduce memory	Smaller channels, lower resolution, checkpointing
Improve latency	Benchmark hardware-specific kernels

For small models, overhead may dominate. For large models, arithmetic cost may dominate. The best choice should be tested on the actual workload.

Summary

Efficient convolutions reduce the cost of CNNs by changing how channels and spatial information are processed. A $1 \times 1$ convolution mixes channels cheaply. Grouped and depthwise convolutions reduce channel coupling. Depthwise separable convolution separates spatial filtering from channel mixing. Bottleneck and inverted residual blocks use these operations to build efficient networks.

Efficient CNN design is a tradeoff among accuracy, parameter count, FLOPs, memory traffic, and hardware latency. The mathematical operation matters, but the final decision should be based on measured performance on the target device.