# Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors. It is used when the whole input sequence is available at once and each position may attend to every other position.

Transformer encoders are common in text understanding, image understanding, speech representation learning, retrieval, classification, tagging, and multimodal systems. BERT-style language models, Vision Transformers, and many embedding models are based on encoder architectures.

### The Encoder Problem

Suppose we have an input sequence of length $T$. Each element of the sequence is represented as a vector of dimension $D$. We write the input as

$$
X \in \mathbb{R}^{B \times T \times D},
$$

where $B$ is the batch size, $T$ is the sequence length, and $D$ is the model dimension.

For a sentence, each position may correspond to one token. For an image, each position may correspond to one patch. For audio, each position may correspond to one frame or segment.

The goal of the encoder is to produce another sequence

$$
H \in \mathbb{R}^{B \times T \times D}.
$$

The output has the same sequence length as the input, but each output vector contains information from the whole sequence.

For example, in the sentence

$$
\text{The bank approved the loan}
$$

the representation of the word “bank” should depend on nearby and distant words. The word “loan” helps determine that “bank” refers to a financial institution. A transformer encoder computes such contextual representations by allowing each token to attend to other tokens.

### Encoder Input Representation

A transformer encoder does not process raw tokens directly. Tokens are first mapped to vectors by an embedding layer.

If the vocabulary size is $V$, the embedding table is

$$
E \in \mathbb{R}^{V \times D}.
$$

A token ID $i$ selects row $E_i$, which becomes the input vector for that token.

For a batch of token IDs

$$
\text{tokens} \in \mathbb{N}^{B \times T},
$$

the embedding layer produces

$$
X_{\text{tok}} \in \mathbb{R}^{B \times T \times D}.
$$

In PyTorch:

```python
import torch
from torch import nn

B, T, V, D = 4, 16, 30_000, 768

tokens = torch.randint(0, V, (B, T))
embedding = nn.Embedding(V, D)

x_tok = embedding(tokens)
print(x_tok.shape)  # torch.Size([4, 16, 768])
```

Token embeddings alone do not tell the model where a token appears in the sequence. A transformer encoder therefore adds positional information.

A common form is learned positional embedding:

$$
X = X_{\text{tok}} + X_{\text{pos}},
$$

where

$$
X_{\text{pos}} \in \mathbb{R}^{1 \times T \times D}.
$$

In PyTorch:

```python
pos_embedding = nn.Embedding(T, D)

positions = torch.arange(T)
x_pos = pos_embedding(positions)[None, :, :]

x = x_tok + x_pos
print(x.shape)  # torch.Size([4, 16, 768])
```

The same positional vectors are broadcast across the batch.

### Encoder Layer Structure

A standard transformer encoder is built by stacking several encoder layers. Each layer contains two main sublayers:

1. Multi-head self-attention.
2. Feedforward network.

Each sublayer is wrapped with a residual connection and normalization.

The common pre-normalization encoder layer has the form

$$
Y = X + \text{SelfAttention}(\text{LayerNorm}(X)),
$$

$$
H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).
$$

The output $H$ becomes the input to the next encoder layer.

The residual connections preserve information and improve gradient flow. Layer normalization stabilizes the scale of activations. Self-attention mixes information across sequence positions. The feedforward network applies a nonlinear transformation independently at each position.

### Self-Attention in the Encoder

Self-attention is the central operation in a transformer encoder. Each position creates three vectors: a query, a key, and a value.

Given input

$$
X \in \mathbb{R}^{B \times T \times D},
$$

we compute

$$
Q = XW_Q,\quad K = XW_K,\quad V = XW_V,
$$

where

$$
W_Q,W_K,W_V \in \mathbb{R}^{D \times d_k}.
$$

The attention scores are computed by comparing queries and keys:

$$
S = \frac{QK^\top}{\sqrt{d_k}}.
$$

The softmax converts scores into attention weights:

$$
A = \text{softmax}(S).
$$

The output is a weighted sum of values:

$$
O = AV.
$$

The complete scaled dot-product attention operation is

$$
\text{Attention}(Q,K,V) =
\text{softmax}
\left(
\frac{QK^\top}{\sqrt{d_k}}
\right)V.
$$

In an encoder, self-attention means that $Q$, $K$, and $V$ all come from the same sequence. Every token can compare itself with every other token.

### Bidirectional Context

A transformer encoder uses bidirectional attention. This means that token $t$ can attend to tokens before and after it.

For a sequence

$$
x_1,x_2,\ldots,x_T,
$$

the output representation at position $t$ may depend on

$$
x_1,\ldots,x_{t-1},x_t,x_{t+1},\ldots,x_T.
$$

This is different from an autoregressive transformer decoder, where position $t$ may only attend to positions $1,\ldots,t$.

Encoder attention is therefore well suited to understanding tasks. The model can use full context when computing each representation.

Examples include:

| Task | Why an encoder fits |
|---|---|
| Text classification | The whole text is available before prediction |
| Named entity recognition | Each token should use left and right context |
| Sentence embedding | The model summarizes a complete input |
| Image classification | All image patches are visible together |
| Retrieval | The full query or document is encoded into a representation |

### Multi-Head Attention

Single-head attention computes one attention pattern. Multi-head attention computes several attention patterns in parallel.

The model dimension $D$ is split across $h$ heads. If there are $h$ heads, each head usually has dimension

$$
d_h = \frac{D}{h}.
$$

Each head has its own query, key, and value projections:

$$
Q_i = XW_{Q_i},\quad K_i = XW_{K_i},\quad V_i = XW_{V_i}.
$$

Each head computes attention separately:

$$
O_i = \text{Attention}(Q_i,K_i,V_i).
$$

The outputs are concatenated and projected:

$$
O = \text{Concat}(O_1,\ldots,O_h)W_O.
$$

Multi-head attention allows the encoder to represent multiple relationships at once. One head may focus on local syntax. Another may focus on long-range dependencies. Another may focus on delimiter tokens, image regions, or semantic similarity. These interpretations are approximate, but they describe why multiple heads are useful.

### Feedforward Network

After self-attention, each position passes through the same feedforward network. This network does not mix positions. It transforms each token representation independently.

A standard transformer feedforward network is

$$
\text{FFN}(x) = W_2 \phi(W_1x + b_1) + b_2,
$$

where $\phi$ is a nonlinear activation function such as ReLU, GELU, or SiLU.

Usually,

$$
W_1 \in \mathbb{R}^{D \times D_{\text{ff}}},
\quad
W_2 \in \mathbb{R}^{D_{\text{ff}} \times D}.
$$

The hidden dimension $D_{\text{ff}}$ is often larger than $D$. A common choice is

$$
D_{\text{ff}} = 4D.
$$

The feedforward network increases the expressive power of the encoder. Self-attention mixes information across tokens. The feedforward network transforms the mixed information at each token position.

### Residual Connections

Residual connections add the input of a sublayer to its output. If a sublayer is $F$, the residual form is

$$
Y = X + F(X).
$$

Residual connections make optimization easier. They allow information and gradients to pass through many layers with less distortion.

In a transformer encoder, residual connections are used around both self-attention and the feedforward network:

$$
Y = X + \text{SelfAttention}(\cdot),
$$

$$
H = Y + \text{FeedForward}(\cdot).
$$

Without residual connections, deep transformer stacks are much harder to train.

### Layer Normalization

Layer normalization normalizes activations across the feature dimension. For a vector $x\in\mathbb{R}^D$, layer normalization computes

$$
\mu = \frac{1}{D}\sum_{i=1}^{D}x_i,
$$

$$
\sigma^2 = \frac{1}{D}\sum_{i=1}^{D}(x_i-\mu)^2,
$$

$$
\text{LayerNorm}(x)_i =
\gamma_i
\frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}
+
\beta_i.
$$

The parameters $\gamma$ and $\beta$ are learned vectors. The small constant $\epsilon$ prevents division by zero.

Layer normalization is preferred over batch normalization in transformers because sequence lengths and batch structures vary, and the computation should remain stable even with small batch sizes.

### Pre-Norm and Post-Norm Encoders

There are two common ways to place layer normalization.

In a post-norm transformer layer:

$$
Y = \text{LayerNorm}(X + \text{SelfAttention}(X)),
$$

$$
H = \text{LayerNorm}(Y + \text{FeedForward}(Y)).
$$

In a pre-norm transformer layer:

$$
Y = X + \text{SelfAttention}(\text{LayerNorm}(X)),
$$

$$
H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).
$$

Post-norm was used in the original transformer architecture. Pre-norm is common in modern deep transformers because it tends to improve training stability.

The difference matters most when the number of layers is large. Pre-norm gives gradients a cleaner path through residual connections.

### Attention Masks

Encoder attention often uses masks. A mask controls which positions can be attended to.

The most common encoder mask is a padding mask. In a batch, sequences may have different lengths. Shorter sequences are padded so all examples have the same length. The model should ignore padding positions.

Suppose a batch contains token IDs:

```python
tokens = torch.tensor([
    [101, 2009, 2003, 2204, 102],
    [101, 7592, 102,    0,   0],
])
```

Here `0` may represent padding. The attention mask is

```python
attention_mask = (tokens != 0)
print(attention_mask)
```

which gives

```python
tensor([
    [ True,  True,  True,  True,  True],
    [ True,  True,  True, False, False]
])
```

The mask prevents real tokens from attending to padding tokens. It also prevents padding positions from contributing meaningful outputs.

Unlike decoder self-attention, encoder self-attention usually does not use a causal mask. The encoder is allowed to see both past and future positions.

### A PyTorch Encoder Layer

PyTorch provides `nn.TransformerEncoderLayer`, but implementing a small encoder layer helps clarify the structure.

```python
import torch
from torch import nn

class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        self.norm1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True,
        )
        self.drop1 = nn.Dropout(dropout)

        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.drop2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor | None = None):
        # x: [B, T, D]
        # key_padding_mask: [B, T], True means "ignore this token"

        y = self.norm1(x)

        attn_out, _ = self.attn(
            y, y, y,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )

        x = x + self.drop1(attn_out)

        y = self.norm2(x)
        ffn_out = self.ffn(y)

        x = x + self.drop2(ffn_out)
        return x
```

This is a pre-norm encoder layer. It preserves shape:

```python
B, T, D = 8, 32, 256

layer = EncoderLayer(d_model=D, n_heads=8, d_ff=1024)
x = torch.randn(B, T, D)

out = layer(x)
print(out.shape)  # torch.Size([8, 32, 256])
```

The input and output shapes match. This makes it easy to stack layers.

### Stacking Encoder Layers

A transformer encoder usually contains $L$ identical layer types, each with separate parameters.

$$
H^{(0)} = X,
$$

$$
H^{(\ell)} = \text{EncoderLayer}^{(\ell)}(H^{(\ell-1)}),
\quad \ell=1,\ldots,L.
$$

The final output is

$$
H = H^{(L)}.
$$

In PyTorch:

```python
class TransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_len: int,
        d_model: int,
        n_heads: int,
        d_ff: int,
        n_layers: int,
        dropout: float = 0.1,
        pad_id: int = 0,
    ):
        super().__init__()

        self.pad_id = pad_id
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])

        self.norm = nn.LayerNorm(d_model)

    def forward(self, tokens: torch.Tensor):
        # tokens: [B, T]
        B, T = tokens.shape

        positions = torch.arange(T, device=tokens.device)
        positions = positions.unsqueeze(0).expand(B, T)

        x = self.token_emb(tokens) + self.pos_emb(positions)

        key_padding_mask = tokens.eq(self.pad_id)

        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask)

        return self.norm(x)
```

Example:

```python
model = TransformerEncoder(
    vocab_size=30_000,
    max_len=512,
    d_model=256,
    n_heads=8,
    d_ff=1024,
    n_layers=6,
)

tokens = torch.randint(0, 30_000, (4, 128))
out = model(tokens)

print(out.shape)  # torch.Size([4, 128, 256])
```

The encoder returns one vector for each token position.

### Using Encoder Outputs

The output of an encoder can be used in different ways.

For token-level tasks, each output vector is used directly. Named entity recognition, part-of-speech tagging, and token classification use this pattern.

If

$$
H\in\mathbb{R}^{B\times T\times D},
$$

a token classifier maps each position to class logits:

$$
Z = HW_{\text{cls}} + b_{\text{cls}},
$$

where

$$
Z\in\mathbb{R}^{B\times T\times K}.
$$

For sequence-level tasks, the model needs one vector for the whole sequence. Common choices include:

| Method | Description |
|---|---|
| First-token pooling | Use the output at position 0 |
| Mean pooling | Average output vectors over valid tokens |
| Max pooling | Take the maximum over positions |
| Attention pooling | Learn a weighted average over positions |

BERT-style models often use a special classification token. The final representation of this token is passed to a classifier.

Mean pooling is common for sentence embedding models:

```python
def mean_pool(hidden, attention_mask):
    # hidden: [B, T, D]
    # attention_mask: [B, T], True for valid tokens

    mask = attention_mask.unsqueeze(-1).float()
    summed = (hidden * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1.0)

    return summed / counts
```

### Encoder Complexity

The main computational cost of a transformer encoder comes from self-attention.

For sequence length $T$, attention constructs a $T \times T$ score matrix for each example and head. Therefore the attention cost grows quadratically with sequence length:

$$
O(T^2D).
$$

The feedforward network cost is roughly

$$
O(TDD_{\text{ff}}).
$$

For short and moderate sequences, feedforward layers may dominate compute. For very long sequences, attention memory and compute become major bottlenecks.

This quadratic cost motivates efficient transformer variants, including sparse attention, linear attention, sliding-window attention, low-rank attention, and state-space alternatives.

### Encoder Versus Decoder

Transformer encoders and decoders share many components, but they serve different purposes.

| Component | Encoder | Decoder |
|---|---|---|
| Self-attention | Bidirectional | Causal |
| Future tokens visible | Yes | No |
| Main use | Understanding and representation | Generation |
| Typical outputs | Contextual embeddings | Next-token logits |
| Example models | BERT, ViT | GPT-style models |

An encoder is best when the full input is available before prediction. A decoder is best when the model must generate outputs step by step.

Encoder-decoder transformers combine both. The encoder reads the source sequence. The decoder generates the target sequence while attending to encoder outputs.

### Vision Transformer Encoders

A Vision Transformer uses an encoder over image patches.

An image is split into patches. Each patch is flattened and projected into a vector. The patch sequence is then passed into a transformer encoder.

For an image batch

$$
X\in\mathbb{R}^{B\times C\times H\times W},
$$

with patch size $P$, the number of patches is

$$
T = \frac{H}{P}\cdot\frac{W}{P}.
$$

Each patch becomes one token. The encoder processes the patch sequence just as it would process a text sequence.

This shows that transformer encoders are general sequence processors. The sequence elements may be words, image patches, audio frames, graph nodes, or retrieved documents.

### Practical Design Choices

Important encoder hyperparameters include:

| Hyperparameter | Meaning |
|---|---|
| $D$ | Model dimension |
| $L$ | Number of encoder layers |
| $h$ | Number of attention heads |
| $D_{\text{ff}}$ | Feedforward hidden dimension |
| $T$ | Maximum sequence length |
| Dropout rate | Regularization strength |
| Positional encoding type | How order information is represented |

Common constraints:

The model dimension $D$ should usually be divisible by the number of heads $h$. Larger $D$ increases representation capacity. Larger $L$ increases depth. Larger $T$ increases memory use sharply because attention scales with $T^2$.

A small encoder may use:

$$
D=256,\quad L=6,\quad h=8,\quad D_{\text{ff}}=1024.
$$

A base-size encoder may use:

$$
D=768,\quad L=12,\quad h=12,\quad D_{\text{ff}}=3072.
$$

A larger encoder increases accuracy on many tasks but also increases training cost, inference latency, and memory use.

### Summary

A transformer encoder maps an input sequence to a contextualized output sequence. It uses bidirectional self-attention to mix information across positions and feedforward networks to transform each position independently.

The core encoder layer consists of multi-head self-attention, a feedforward network, residual connections, layer normalization, dropout, and masking. Stacking many such layers gives a deep sequence representation model.

Encoders are suited to understanding tasks because each output position can use the full input context. They are used in text encoders, image encoders, embedding models, retrieval systems, speech models, and multimodal representation learners.