# Efficient Convolutions

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling. They are important when models must run on mobile devices, edge hardware, browsers, real-time systems, or large-scale training clusters.

A standard convolution is powerful, but expensive. If an input has $C_{\text{in}}$ channels and an output has $C_{\text{out}}$ channels, a $k \times k$ convolution uses

$$
C_{\text{out}} C_{\text{in}} k^2
$$

weights. It also performs this many multiply-add operations at every output spatial location.

Efficient convolution methods reduce this cost by changing how spatial mixing and channel mixing are performed.

### Cost of a Standard Convolution

For an input tensor

$$
X \in \mathbb{R}^{B \times C_{\text{in}} \times H \times W},
$$

a standard convolution with $C_{\text{out}}$ output channels and a $k \times k$ kernel produces

$$
Y \in \mathbb{R}^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}.
$$

Ignoring bias, the number of parameters is

$$
C_{\text{out}} C_{\text{in}} k^2.
$$

For one output image, the approximate multiply-add count is

$$
H_{\text{out}} W_{\text{out}} C_{\text{out}} C_{\text{in}} k^2.
$$

Example:

$$
C_{\text{in}}=64,\quad C_{\text{out}}=128,\quad k=3.
$$

The parameter count is

$$
128 \cdot 64 \cdot 3^2 = 73{,}728.
$$

If the output feature map is $56 \times 56$, the convolution performs roughly

$$
56 \cdot 56 \cdot 128 \cdot 64 \cdot 9
$$

multiply-adds. This is large, and CNNs contain many such layers.

### One by One Convolution

A $1 \times 1$ convolution mixes channels without mixing neighboring spatial positions.

For each spatial location $(i,j)$, the input channel vector is

$$
x_{i,j} \in \mathbb{R}^{C_{\text{in}}}.
$$

A $1 \times 1$ convolution applies the same linear map at every location:

$$
y_{i,j} = W x_{i,j} + b.
$$

The weight tensor has shape

$$
[C_{\text{out}}, C_{\text{in}}, 1, 1].
$$

So the parameter count is

$$
C_{\text{out}} C_{\text{in}}.
$$

In PyTorch:

```python
import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=256,
    out_channels=64,
    kernel_size=1,
)

x = torch.randn(8, 256, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])
```

A $1 \times 1$ convolution is commonly used to reduce channels before an expensive spatial convolution.

### Bottleneck Convolutions

A bottleneck block uses $1 \times 1$ convolutions to reduce and then restore channel dimension.

The pattern is

$$
1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1.
$$

The first layer reduces channels. The middle layer performs spatial processing on fewer channels. The final layer expands channels.

Example:

```python
class BottleneckConv(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, mid_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(),

            nn.Conv2d(mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        return self.net(x)
```

Suppose

$$
C_{\text{in}} = C_{\text{out}} = 256,
\quad
C_{\text{mid}} = 64.
$$

A direct $3 \times 3$ convolution has

$$
256 \cdot 256 \cdot 9 = 589{,}824
$$

weights.

The bottleneck version has

$$
256 \cdot 64 + 64 \cdot 64 \cdot 9 + 64 \cdot 256 =
69{,}632
$$

weights.

This is much cheaper while still allowing spatial processing.

### Grouped Convolution

Grouped convolution splits the input channels into groups. Each group is convolved separately. The outputs are then concatenated along the channel axis.

If there are $g$ groups, each group sees only

$$
\frac{C_{\text{in}}}{g}
$$

input channels and produces

$$
\frac{C_{\text{out}}}{g}
$$

output channels.

The parameter count becomes

$$
\frac{C_{\text{out}} C_{\text{in}} k^2}{g}.
$$

In PyTorch:

```python
conv = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    padding=1,
    groups=4,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 128, 32, 32])
```

Both `in_channels` and `out_channels` must be divisible by `groups`.

Grouped convolution reduces computation, but it also limits communication between channel groups. Later layers, often $1 \times 1$ convolutions, can mix information across groups.

### Depthwise Convolution

Depthwise convolution is the extreme case of grouped convolution where

$$
g = C_{\text{in}}.
$$

Each input channel gets its own spatial filter. There is no channel mixing inside the depthwise convolution.

In PyTorch:

```python
depthwise = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=1,
    groups=64,
)

x = torch.randn(8, 64, 32, 32)
y = depthwise(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])
```

For a $3 \times 3$ depthwise convolution with $C$ channels, the parameter count is

$$
C \cdot 3^2.
$$

This is far smaller than

$$
C^2 \cdot 3^2
$$

for a standard convolution with the same input and output channel count.

### Depthwise Separable Convolution

Depthwise separable convolution splits standard convolution into two parts:

1. Depthwise convolution for spatial mixing.
2. Pointwise $1 \times 1$ convolution for channel mixing.

The block is

$$
\text{depthwise } 3 \times 3
\rightarrow
\text{pointwise } 1 \times 1.
$$

In PyTorch:

```python
class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(
                in_channels,
                in_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=in_channels,
                bias=False,
            ),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),

            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)
```

A standard $3 \times 3$ convolution has

$$
C_{\text{in}} C_{\text{out}} 9
$$

weights.

A depthwise separable convolution has

$$
C_{\text{in}} 9 + C_{\text{in}} C_{\text{out}}
$$

weights.

For

$$
C_{\text{in}}=64,\quad C_{\text{out}}=128,
$$

standard convolution has

$$
64 \cdot 128 \cdot 9 = 73{,}728
$$

weights.

Depthwise separable convolution has

$$
64 \cdot 9 + 64 \cdot 128 = 8{,}768
$$

weights.

This is about $8.4$ times fewer parameters.

### Inverted Residual Blocks

An inverted residual block is common in mobile CNNs. It uses the opposite shape pattern from a classical bottleneck.

A classical bottleneck compresses channels, processes spatially, then expands:

$$
\text{wide} \rightarrow \text{narrow} \rightarrow \text{wide}.
$$

An inverted residual block expands channels, applies depthwise convolution, then projects back:

$$
\text{narrow} \rightarrow \text{wide} \rightarrow \text{narrow}.
$$

The structure is usually:

$$
1 \times 1 \text{ expansion}
\rightarrow
3 \times 3 \text{ depthwise}
\rightarrow
1 \times 1 \text{ projection}.
$$

```python
class InvertedResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, expansion=4, stride=1):
        super().__init__()

        hidden_channels = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(
                hidden_channels,
                hidden_channels,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden_channels,
                bias=False,
            ),
            nn.BatchNorm2d(hidden_channels),
            nn.ReLU(),

            nn.Conv2d(hidden_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)
        if self.use_shortcut:
            y = y + x
        return y
```

The expansion gives the block enough channel capacity. The depthwise layer handles spatial structure cheaply. The projection returns to a compact representation.

### Channel Shuffle

Grouped convolutions reduce communication between groups. Channel shuffle is a simple operation that mixes channels across groups by reshaping and permuting the channel axis.

Suppose a tensor has shape

$$
[B, C, H, W]
$$

and $g$ groups. We can view the channels as

$$
[B, g, C/g, H, W],
$$

swap the group and within-group axes, then flatten back to

$$
[B, C, H, W].
$$

```python
def channel_shuffle(x, groups):
    b, c, h, w = x.shape
    assert c % groups == 0

    x = x.reshape(b, groups, c // groups, h, w)
    x = x.transpose(1, 2)
    x = x.reshape(b, c, h, w)
    return x
```

Channel shuffle is useful when grouped convolutions are stacked. It helps later groups receive information from earlier groups.

### Dilated Convolution as Efficient Context

Dilated convolution increases receptive field without increasing kernel size. A dilation rate $d$ inserts gaps between kernel positions.

A $3 \times 3$ kernel with dilation $2$ covers the spatial span of a $5 \times 5$ area but uses only nine weights.

In PyTorch:

```python
conv = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=2,
    dilation=2,
)

x = torch.randn(8, 64, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])
```

Dilated convolutions are useful in segmentation, detection, audio models, and any task that needs larger context without aggressive downsampling.

### Factorized Convolutions

A large convolution can sometimes be factorized into smaller operations.

For example, a $5 \times 5$ convolution can be replaced by two $3 \times 3$ convolutions. This adds an extra nonlinearity and often reduces parameters.

Another factorization replaces a $k \times k$ convolution with:

$$
k \times 1
\rightarrow
1 \times k.
$$

For a $7 \times 7$ convolution, this changes the spatial parameter count from

$$
49
$$

to

$$
14.
$$

Example:

```python
factorized = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=(7, 1), padding=(3, 0), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),

    nn.Conv2d(64, 64, kernel_size=(1, 7), padding=(0, 3), bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)
```

Factorized convolutions reduce computation while preserving a large directional receptive field.

### Squeeze-and-Excitation

Squeeze-and-excitation blocks improve channel efficiency by letting the network reweight channels dynamically.

First, global average pooling summarizes each channel:

$$
z_c =
\frac{1}{HW}
\sum_i
\sum_j
X_{c,i,j}.
$$

Then a small network predicts channel weights. These weights scale the original feature maps.

```python
class SqueezeExcitation(nn.Module):
    def __init__(self, channels, reduction=4):
        super().__init__()

        hidden = channels // reduction

        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.gate = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels, hidden),
            nn.ReLU(),
            nn.Linear(hidden, channels),
            nn.Sigmoid(),
        )

    def forward(self, x):
        b, c, h, w = x.shape

        weights = self.pool(x)
        weights = self.gate(weights)
        weights = weights.reshape(b, c, 1, 1)

        return x * weights
```

This block adds modest cost and can improve accuracy by emphasizing useful channels and suppressing less useful ones.

### Efficient Blocks in Practice

An efficient convolutional block often combines several ideas:

```python
class EfficientConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, expansion=4):
        super().__init__()

        hidden = in_channels * expansion
        self.use_shortcut = stride == 1 and in_channels == out_channels

        self.block = nn.Sequential(
            nn.Conv2d(in_channels, hidden, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            nn.Conv2d(
                hidden,
                hidden,
                kernel_size=3,
                stride=stride,
                padding=1,
                groups=hidden,
                bias=False,
            ),
            nn.BatchNorm2d(hidden),
            nn.ReLU(),

            SqueezeExcitation(hidden),

            nn.Conv2d(hidden, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, x):
        y = self.block(x)

        if self.use_shortcut:
            y = y + x

        return y
```

This block uses pointwise expansion, depthwise spatial filtering, channel reweighting, pointwise projection, and an optional residual path.

### Efficiency Is Hardware-Dependent

Fewer parameters do not always mean faster inference. Actual speed depends on hardware, memory bandwidth, kernel implementation, batch size, tensor layout, and compiler support.

Depthwise convolutions have few arithmetic operations, but on some hardware they may be memory-bound. A standard convolution may run faster than expected because it maps well to optimized matrix multiplication kernels.

When optimizing a CNN, measure real latency:

```python
import time
import torch

def benchmark(module, x, steps=100):
    module.eval()

    # Warmup
    with torch.no_grad():
        for _ in range(10):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    start = time.time()

    with torch.no_grad():
        for _ in range(steps):
            module(x)

    if x.is_cuda:
        torch.cuda.synchronize()

    return (time.time() - start) / steps

x = torch.randn(1, 64, 128, 128)
module = nn.Conv2d(64, 128, kernel_size=3, padding=1)

print(benchmark(module, x))
```

For deployment, benchmark the target device, not only the development machine.

### Choosing an Efficient Convolution

The right efficient convolution depends on the constraint.

| Constraint | Useful method |
|---|---|
| Reduce parameters | Bottlenecks, depthwise separable convolution |
| Reduce FLOPs | Depthwise separable convolution, grouped convolution |
| Increase receptive field | Dilated convolution, factorized large kernels |
| Mobile inference | Inverted residual blocks |
| Preserve accuracy | Squeeze-and-excitation, residual connections |
| Reduce memory | Smaller channels, lower resolution, checkpointing |
| Improve latency | Benchmark hardware-specific kernels |

For small models, overhead may dominate. For large models, arithmetic cost may dominate. The best choice should be tested on the actual workload.

### Summary

Efficient convolutions reduce the cost of CNNs by changing how channels and spatial information are processed. A $1 \times 1$ convolution mixes channels cheaply. Grouped and depthwise convolutions reduce channel coupling. Depthwise separable convolution separates spatial filtering from channel mixing. Bottleneck and inverted residual blocks use these operations to build efficient networks.

Efficient CNN design is a tradeoff among accuracy, parameter count, FLOPs, memory traffic, and hardware latency. The mathematical operation matters, but the final decision should be based on measured performance on the target device.

