# Transformer Encoders

A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation. Each output position can contain information from every visible input position. Encoder models are useful when the full input is available before prediction.

Examples include text classification, named entity recognition, sentence embedding, document ranking, image classification with vision transformers, and many multimodal encoders.

Given an input tensor

$$
X \in \mathbb{R}^{B \times T \times D},
$$

a transformer encoder returns

$$
H \in \mathbb{R}^{B \times T \times D}.
$$

The sequence length and model dimension usually stay unchanged across encoder layers.

### Encoder Layer Structure

A standard transformer encoder layer has two main sublayers:

| Sublayer | Purpose |
|---|---|
| Multi-head self-attention | Allows tokens to exchange information |
| Feedforward network | Applies nonlinear transformation to each token |

Each sublayer is wrapped with residual connections and layer normalization.

The common modern form is pre-normalization:

$$
X_1 = X + \operatorname{SelfAttention}(\operatorname{LayerNorm}(X)),
$$

$$
Y = X_1 + \operatorname{FFN}(\operatorname{LayerNorm}(X_1)).
$$

The output $Y$ has the same shape as $X$.

Pre-normalization is widely used because it tends to train deep transformers more stably than the original post-normalization layout.

### Self-Attention in the Encoder

Encoder self-attention is usually bidirectional. Each token can attend to all non-padding tokens.

For a sequence:

```text
The cat sat on the mat
```

the token `cat` can attend to `The`, `sat`, `on`, `the`, and `mat`. There is no causal restriction because the whole sequence is already known.

The only common mask is the padding mask. If the batch contains padded sequences, the attention layer should hide padding tokens:

```text
[The, cat, sat, <pad>, <pad>]
[The, dog, ran, outside, today]
```

The padding positions should not contribute useful information.

### Feedforward Network

The feedforward network is applied independently to each position. It has the same weights for all positions.

A common form is:

$$
\operatorname{FFN}(x) =
W_2 \sigma(W_1x + b_1) + b_2.
$$

Here $x\in\mathbb{R}^{D}$, $W_1$ expands the dimension, $\sigma$ is an activation function such as GELU or ReLU, and $W_2$ projects back to $D$.

Usually,

$$
D_{\text{ff}} \approx 4D.
$$

For example, if $D=768$, the feedforward hidden dimension may be 3072.

In PyTorch:

```python
from torch import nn

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)
```

Although self-attention mixes information across tokens, the feedforward network transforms each token vector independently. Together, these two operations give the encoder both communication and computation.

### Residual Connections

Residual connections add the input of a sublayer to its output:

$$
Y = X + F(X).
$$

They help gradients flow through deep networks and allow each layer to learn an incremental update rather than a complete replacement.

In transformer encoders, residual connections preserve shape. The self-attention sublayer maps `[B, T, D]` to `[B, T, D]`, so it can be added back to the input. The feedforward sublayer also maps `[B, T, D]` to `[B, T, D]`.

### Layer Normalization

Layer normalization normalizes each token vector across its feature dimension. For a vector $x\in\mathbb{R}^{D}$, layer normalization computes a normalized vector using the mean and variance of that vector’s entries.

In PyTorch:

```python
norm = nn.LayerNorm(d_model)
```

Layer normalization is preferred over batch normalization in transformers because sequence lengths can vary, batch sizes can be small, and token-level normalization is more natural than batch-level statistics.

### A Minimal Encoder Layer in PyTorch

```python
import torch
from torch import nn

class TransformerEncoderLayer(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True,
        )

        self.ffn = FeedForward(d_model, d_ff, dropout)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, key_padding_mask=None):
        # x: [B, T, D]
        # key_padding_mask: [B, T], True means masked/padding

        h = self.norm1(x)
        attn_out, _ = self.self_attn(
            query=h,
            key=h,
            value=h,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )
        x = x + self.dropout1(attn_out)

        h = self.norm2(x)
        ffn_out = self.ffn(h)
        x = x + self.dropout2(ffn_out)

        return x
```

Example:

```python
B = 4
T = 16
D = 128

x = torch.randn(B, T, D)

layer = TransformerEncoderLayer(
    d_model=128,
    num_heads=8,
    d_ff=512,
)

y = layer(x)

print(y.shape)  # torch.Size([4, 16, 128])
```

The encoder layer preserves the sequence shape.

### Padding Masks in PyTorch

PyTorch’s `nn.MultiheadAttention` uses `key_padding_mask` with shape `[B, T]`.

A value of `True` means the key position should be ignored.

```python
tokens = torch.tensor([
    [101, 2009, 2003, 102, 0, 0],
    [101, 2023, 2003, 1037, 3231, 102],
])

pad_token_id = 0
key_padding_mask = tokens.eq(pad_token_id)

print(key_padding_mask)
```

Output:

```text
tensor([[False, False, False, False,  True,  True],
        [False, False, False, False, False, False]])
```

The mask is then passed to the encoder layer:

```python
x = torch.randn(2, 6, 128)

y = layer(x, key_padding_mask=key_padding_mask)
```

The padding tokens are still present in the output tensor, but real tokens cannot attend to them.

### Stacking Encoder Layers

A transformer encoder stacks multiple encoder layers:

$$
H^{(0)} = X,
$$

$$
H^{(\ell+1)} =
\operatorname{EncoderLayer}^{(\ell)}(H^{(\ell)}).
$$

The depth controls how many rounds of token interaction and nonlinear transformation occur.

```python
class TransformerEncoder(nn.Module):
    def __init__(
        self,
        num_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.layers = nn.ModuleList([
            TransformerEncoderLayer(
                d_model=d_model,
                num_heads=num_heads,
                d_ff=d_ff,
                dropout=dropout,
            )
            for _ in range(num_layers)
        ])

        self.final_norm = nn.LayerNorm(d_model)

    def forward(self, x, key_padding_mask=None):
        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask)

        return self.final_norm(x)
```

Example:

```python
encoder = TransformerEncoder(
    num_layers=6,
    d_model=256,
    num_heads=8,
    d_ff=1024,
)

x = torch.randn(2, 32, 256)
y = encoder(x)

print(y.shape)  # torch.Size([2, 32, 256])
```

### Token Embeddings and Position Embeddings

An encoder for text usually starts with token embeddings and position embeddings.

```python
class TextTransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_length: int,
        num_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
        pad_token_id: int = 0,
    ):
        super().__init__()

        self.pad_token_id = pad_token_id

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_length, d_model)

        self.dropout = nn.Dropout(dropout)

        self.encoder = TransformerEncoder(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            d_ff=d_ff,
            dropout=dropout,
        )

    def forward(self, token_ids):
        # token_ids: [B, T]
        B, T = token_ids.shape

        positions = torch.arange(T, device=token_ids.device)
        positions = positions.unsqueeze(0).expand(B, T)

        x = self.token_embedding(token_ids)
        p = self.position_embedding(positions)

        x = self.dropout(x + p)

        key_padding_mask = token_ids.eq(self.pad_token_id)

        return self.encoder(x, key_padding_mask=key_padding_mask)
```

This module returns contextual token representations:

```python
model = TextTransformerEncoder(
    vocab_size=30_000,
    max_length=512,
    num_layers=6,
    d_model=256,
    num_heads=8,
    d_ff=1024,
)

token_ids = torch.randint(0, 30_000, (4, 128))
hidden = model(token_ids)

print(hidden.shape)  # torch.Size([4, 128, 256])
```

### Encoder Outputs

The encoder output has one vector per input token:

$$
H \in \mathbb{R}^{B \times T \times D}.
$$

Different tasks use this output differently.

For token classification, such as named entity recognition, we classify each token vector:

```python
logits = classifier(hidden)
```

The output shape is:

```python
[B, T, num_labels]
```

For sequence classification, we need one vector for the whole sequence. Common choices include using a special class token, mean pooling, or attention pooling.

With a class token:

```python
cls_repr = hidden[:, 0, :]
logits = classifier(cls_repr)
```

With mean pooling over non-padding tokens:

```python
mask = token_ids.ne(pad_token_id).float()
mask = mask.unsqueeze(-1)

pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
logits = classifier(pooled)
```

### Encoder Models

Encoder-only transformers are common when the task requires understanding an input rather than generating text token by token.

Examples include:

| Model family | Main use |
|---|---|
| BERT-style encoders | Text understanding |
| RoBERTa-style encoders | Stronger masked-language pretraining |
| Sentence transformers | Embeddings and retrieval |
| Vision transformers | Image classification and visual representation |
| Audio encoders | Speech and audio representation |
| Multimodal encoders | Joint image-text or audio-text representation |

These models differ in data, pretraining objective, tokenizer, positional method, architecture size, and downstream task.

### Encoder Versus Decoder

An encoder sees the full input. A decoder generates outputs autoregressively.

| Property | Encoder | Decoder |
|---|---|---|
| Attention mask | Usually bidirectional | Usually causal |
| Main task | Understand input | Generate output |
| Common output | Contextual representations | Next-token logits |
| Example use | Classification, retrieval, tagging | Language modeling, text generation |

Encoder-decoder models combine both. The encoder processes the source sequence. The decoder generates the target sequence while attending to encoder outputs.

### Summary

A transformer encoder maps an input sequence to contextual representations. Each layer contains bidirectional multi-head self-attention, a position-wise feedforward network, residual connections, and layer normalization.

Encoder models are well suited to tasks where the full input is available: classification, tagging, retrieval, ranking, image recognition, and representation learning.

In PyTorch, a basic encoder can be built from `nn.MultiheadAttention`, `nn.LayerNorm`, feedforward layers, residual connections, and padding masks.