# Feature Maps

A feature map is the spatial output produced by a convolutional filter. In a convolutional neural network, each output channel can be read as a map of where a learned feature appears in the input.

If a filter detects vertical edges, its feature map contains high values at locations where vertical edges are present. If a filter detects a texture, its feature map contains high values where that texture appears. The model learns these filters from data.

### From Images to Feature Maps

An image tensor contains raw input values. For an RGB image, the tensor shape is

$$
[3, H, W].
$$

After a convolutional layer with $C_{\text{out}}$ filters, the output shape becomes

$$
[C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].
$$

Each of the $C_{\text{out}}$ output channels is one feature map.

For a batch, PyTorch uses

$$
[B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}].
$$

Here $B$ is batch size. The second axis indexes feature maps.

```python
import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    padding=1,
)

x = torch.randn(8, 3, 32, 32)
y = conv(x)

print(y.shape)  # torch.Size([8, 16, 32, 32])
```

This layer produces 16 feature maps for each image.

### What a Feature Map Represents

A feature map is a grid of activations. Each location corresponds to a local region of the input. The value at that location measures how strongly the learned filter responds to that region.

For one output channel $o$, the feature map is

$$
Y_o \in \mathbb{R}^{H_{\text{out}} \times W_{\text{out}}}.
$$

A single entry is computed by applying filter $o$ to one input patch:

$$
Y_{o,i,j} =
b_o
+
\sum_c
\sum_u
\sum_v
W_{o,c,u,v}
X_{c,i+u,j+v}.
$$

Before activation, this value is often called a pre-activation. After a nonlinear function such as ReLU, it becomes an activation.

```python
layer = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, padding=1),
    nn.ReLU(),
)

x = torch.randn(8, 3, 32, 32)
a = layer(x)

print(a.shape)  # torch.Size([8, 16, 32, 32])
```

The ReLU removes negative responses and keeps positive evidence for learned features.

### Channels as Learned Feature Detectors

In the input image, channels have fixed meanings. For RGB images, channel 0 may represent red, channel 1 green, and channel 2 blue.

In hidden layers, channels have learned meanings. A hidden channel may respond to edges, textures, colors, shapes, object parts, or task-specific patterns. The interpretation depends on the layer and the training data.

Early feature maps tend to detect low-level patterns. Deeper feature maps tend to detect more abstract patterns.

| Network stage | Typical feature map behavior |
|---|---|
| Early layers | Edges, color contrasts, corners, simple textures |
| Middle layers | Repeated textures, local shapes, object parts |
| Late layers | Class-specific parts, semantic regions, high-level concepts |

These descriptions are approximate. Neural networks do not assign clean human labels to every channel. A feature map is best understood by its effect on the model’s computation.

### Spatial Resolution and Semantic Depth

As a CNN goes deeper, two things usually happen.

First, spatial resolution decreases. Pooling or strided convolution reduces height and width.

Second, semantic depth increases. The number of channels often grows, and each channel represents more abstract information.

A common progression is:

$$
[3, 224, 224]
\rightarrow
[64, 112, 112]
\rightarrow
[128, 56, 56]
\rightarrow
[256, 28, 28]
\rightarrow
[512, 14, 14]
\rightarrow
[1024, 7, 7].
$$

This trades spatial detail for richer feature representations.

Large early feature maps preserve precise location. Small late feature maps summarize broader regions of the image. Classification benefits from this compression. Segmentation and detection must preserve or recover spatial detail.

### Feature Maps in Classification

In image classification, the final feature maps are usually converted into a vector and passed to a classifier.

One common method is global average pooling:

$$
X \in \mathbb{R}^{B \times C \times H \times W}
$$

is converted to

$$
Z \in \mathbb{R}^{B \times C}.
$$

Each channel is averaged over all spatial positions. The result is a feature vector for each image.

```python
model_head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(512, 1000),
)

x = torch.randn(8, 512, 7, 7)
logits = model_head(x)

print(logits.shape)  # torch.Size([8, 1000])
```

The classifier reads the final channel summaries and produces class logits.

### Feature Maps in Detection and Segmentation

Detection and segmentation need spatial outputs. They cannot reduce everything to a single vector too early.

In object detection, feature maps provide spatial grids where the model predicts boxes, classes, or objectness scores.

In semantic segmentation, the model predicts a class for each pixel or each high-resolution spatial location. Feature maps must therefore preserve enough positional information.

A segmentation model may produce an output such as

$$
[B, K, H, W],
$$

where $K$ is the number of semantic classes. Each spatial position has a $K$-dimensional vector of class scores.

```python
seg_head = nn.Conv2d(
    in_channels=256,
    out_channels=21,
    kernel_size=1,
)

x = torch.randn(4, 256, 64, 64)
logits = seg_head(x)

print(logits.shape)  # torch.Size([4, 21, 64, 64])
```

Here each of the 21 output channels represents one class score map.

### One by One Convolutions on Feature Maps

A $1 \times 1$ convolution mixes channels at each spatial location. It does not look at neighboring pixels. Instead, it applies a learned linear transformation to the channel vector at every location.

If

$$
X_{:,i,j} \in \mathbb{R}^{C_{\text{in}}},
$$

then a $1 \times 1$ convolution maps it to

$$
Y_{:,i,j} \in \mathbb{R}^{C_{\text{out}}}.
$$

In PyTorch:

```python
conv1x1 = nn.Conv2d(
    in_channels=256,
    out_channels=64,
    kernel_size=1,
)

x = torch.randn(8, 256, 32, 32)
y = conv1x1(x)

print(y.shape)  # torch.Size([8, 64, 32, 32])
```

This is useful for reducing channel count, expanding channel count, and combining features before or after spatial convolutions.

### Visualizing Feature Maps

Feature maps can be inspected by selecting channels from an activation tensor.

```python
x = torch.randn(1, 3, 32, 32)

conv = nn.Sequential(
    nn.Conv2d(3, 8, kernel_size=3, padding=1),
    nn.ReLU(),
)

features = conv(x)

first_feature_map = features[0, 0]
print(first_feature_map.shape)  # torch.Size([32, 32])
```

The tensor `features[0, 0]` is the first feature map for the first image.

Visualizing feature maps can help debug models. For example, if all activations are nearly zero after a layer, the model may have poor initialization, an unsuitable learning rate, or saturated activations.

Feature map visualization should be interpreted carefully. Individual channels rarely have simple meanings across all inputs.

### Activation Magnitudes

Feature maps contain numerical activations, not direct probabilities. A high value means a strong response from the filter, but the meaning depends on the layer, normalization, activation function, and downstream computation.

Batch normalization, layer normalization, ReLU, GELU, pooling, and residual connections can all change the distribution of activations.

A useful debugging check is to inspect basic statistics:

```python
with torch.no_grad():
    features = conv(x)

print(features.mean())
print(features.std())
print(features.min())
print(features.max())
```

Extremely large values may indicate exploding activations. Values near zero everywhere may indicate dead filters or overly aggressive normalization.

### Feature Maps and Receptive Fields

Each location in a feature map corresponds to a receptive field in the original input. In early layers, the receptive field is small. In deeper layers, it grows.

For example, after one $3 \times 3$ convolution, one feature-map location depends on a $3 \times 3$ input patch. After two such layers, it depends on a $5 \times 5$ region. After several layers and downsampling operations, each location may depend on a large part of the input image.

Thus a deep feature map has two coordinates:

1. A spatial coordinate in the feature map.
2. A corresponding receptive field in the original image.

This connection is important for detection, segmentation, and attention over visual features.

### Feature Maps and Memory Cost

Feature maps can consume more memory than model parameters during training. This is because backpropagation needs intermediate activations to compute gradients.

For a tensor of shape

$$
[B, C, H, W],
$$

the number of values is

$$
BCHW.
$$

For example:

$$
B=32,\quad C=128,\quad H=W=56.
$$

Then the feature map contains

$$
32 \cdot 128 \cdot 56 \cdot 56 = 12{,}845{,}056
$$

values.

With `float32`, this activation tensor alone uses about

$$
12{,}845{,}056 \cdot 4 \approx 51.4
$$

MB of memory.

This is only one layer. During training, many such tensors may be stored. This is why batch size, image resolution, and channel count strongly affect GPU memory usage.

### Contiguous Layout and Channel Order

PyTorch convolution layers usually expect tensors in NCHW layout:

$$
[B, C, H, W].
$$

Some image libraries use NHWC layout:

$$
[B, H, W, C].
$$

Before passing such tensors to `nn.Conv2d`, the axes must be permuted.

```python
x = torch.randn(8, 224, 224, 3)  # NHWC
x = x.permute(0, 3, 1, 2)        # NCHW

print(x.shape)  # torch.Size([8, 3, 224, 224])
```

After `permute`, the tensor may have a non-contiguous memory layout. Some operations require or benefit from contiguous memory.

```python
x = x.contiguous()
```

Shape and layout are separate concepts. Two tensors can have the same shape but different memory strides.

### Feature Maps Versus Embeddings

A feature map preserves spatial structure. An embedding is usually a vector representation.

For images, a CNN may produce feature maps:

$$
[B, C, H, W].
$$

After global pooling, it produces embeddings:

$$
[B, C].
$$

The feature map says where features appear. The embedding summarizes what features are present.

This distinction matters. Classification often uses embeddings. Detection, segmentation, pose estimation, and image generation often need feature maps.

### Summary

A feature map is an output channel of a convolutional layer. It records the spatial response of a learned filter across an input or previous feature tensor.

Feature maps preserve spatial organization while increasing representational richness. Early feature maps capture simple patterns. Deeper feature maps capture broader and more task-specific structures.

In PyTorch, convolutional feature maps usually have shape

$$
[B, C, H, W].
$$

The batch axis selects examples, the channel axis selects feature maps, and the two spatial axes locate activations.