Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors. It is used when the whole input sequence is available at once and each position may attend to every other position.

Transformer encoders are common in text understanding, image understanding, speech representation learning, retrieval, classification, tagging, and multimodal systems. BERT-style language models, Vision Transformers, and many embedding models are based on encoder architectures.

The Encoder Problem

Suppose we have an input sequence of length $T$ . Each element of the sequence is represented as a vector of dimension $D$ . We write the input as

X \in \mathbb{R}^{B \times T \times D},

where $B$ is the batch size, $T$ is the sequence length, and $D$ is the model dimension.

For a sentence, each position may correspond to one token. For an image, each position may correspond to one patch. For audio, each position may correspond to one frame or segment.

The goal of the encoder is to produce another sequence

H \in \mathbb{R}^{B \times T \times D}.

The output has the same sequence length as the input, but each output vector contains information from the whole sequence.

For example, in the sentence

\text{The bank approved the loan}

the representation of the word “bank” should depend on nearby and distant words. The word “loan” helps determine that “bank” refers to a financial institution. A transformer encoder computes such contextual representations by allowing each token to attend to other tokens.

Encoder Input Representation

A transformer encoder does not process raw tokens directly. Tokens are first mapped to vectors by an embedding layer.

If the vocabulary size is $V$ , the embedding table is

E \in \mathbb{R}^{V \times D}.

A token ID $i$ selects row $E_i$ , which becomes the input vector for that token.

For a batch of token IDs

\text{tokens} \in \mathbb{N}^{B \times T},

the embedding layer produces

X_{\text{tok}} \in \mathbb{R}^{B \times T \times D}.

In PyTorch:

import torch
from torch import nn

B, T, V, D = 4, 16, 30_000, 768

tokens = torch.randint(0, V, (B, T))
embedding = nn.Embedding(V, D)

x_tok = embedding(tokens)
print(x_tok.shape)  # torch.Size([4, 16, 768])

Token embeddings alone do not tell the model where a token appears in the sequence. A transformer encoder therefore adds positional information.

A common form is learned positional embedding:

X = X_{\text{tok}} + X_{\text{pos}},

where

X_{\text{pos}} \in \mathbb{R}^{1 \times T \times D}.

In PyTorch:

pos_embedding = nn.Embedding(T, D)

positions = torch.arange(T)
x_pos = pos_embedding(positions)[None, :, :]

x = x_tok + x_pos
print(x.shape)  # torch.Size([4, 16, 768])

The same positional vectors are broadcast across the batch.

Encoder Layer Structure

A standard transformer encoder is built by stacking several encoder layers. Each layer contains two main sublayers:

Multi-head self-attention.
Feedforward network.

Each sublayer is wrapped with a residual connection and normalization.

The common pre-normalization encoder layer has the form

Y = X + \text{SelfAttention}(\text{LayerNorm}(X)),

H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).

The output $H$ becomes the input to the next encoder layer.

The residual connections preserve information and improve gradient flow. Layer normalization stabilizes the scale of activations. Self-attention mixes information across sequence positions. The feedforward network applies a nonlinear transformation independently at each position.

Self-Attention in the Encoder

Self-attention is the central operation in a transformer encoder. Each position creates three vectors: a query, a key, and a value.

Given input

X \in \mathbb{R}^{B \times T \times D},

we compute

Q = XW_Q,\quad K = XW_K,\quad V = XW_V,

where

W_Q,W_K,W_V \in \mathbb{R}^{D \times d_k}.

The attention scores are computed by comparing queries and keys:

S = \frac{QK^\top}{\sqrt{d_k}}.

The softmax converts scores into attention weights:

A = \text{softmax}(S).

The output is a weighted sum of values:

O = AV.

The complete scaled dot-product attention operation is

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V.

In an encoder, self-attention means that $Q$ , $K$ , and $V$ all come from the same sequence. Every token can compare itself with every other token.

Bidirectional Context

A transformer encoder uses bidirectional attention. This means that token $t$ can attend to tokens before and after it.

For a sequence

x_1,x_2,\ldots,x_T,

the output representation at position $t$ may depend on

x_1,\ldots,x_{t-1},x_t,x_{t+1},\ldots,x_T.

This is different from an autoregressive transformer decoder, where position $t$ may only attend to positions $1,\ldots,t$ .

Encoder attention is therefore well suited to understanding tasks. The model can use full context when computing each representation.

Examples include:

Task	Why an encoder fits
Text classification	The whole text is available before prediction
Named entity recognition	Each token should use left and right context
Sentence embedding	The model summarizes a complete input
Image classification	All image patches are visible together
Retrieval	The full query or document is encoded into a representation

Multi-Head Attention

Single-head attention computes one attention pattern. Multi-head attention computes several attention patterns in parallel.

The model dimension $D$ is split across $h$ heads. If there are $h$ heads, each head usually has dimension

d_h = \frac{D}{h}.

Each head has its own query, key, and value projections:

Q_i = XW_{Q_i},\quad K_i = XW_{K_i},\quad V_i = XW_{V_i}.

Each head computes attention separately:

O_i = \text{Attention}(Q_i,K_i,V_i).

The outputs are concatenated and projected:

O = \text{Concat}(O_1,\ldots,O_h)W_O.

Multi-head attention allows the encoder to represent multiple relationships at once. One head may focus on local syntax. Another may focus on long-range dependencies. Another may focus on delimiter tokens, image regions, or semantic similarity. These interpretations are approximate, but they describe why multiple heads are useful.

Feedforward Network

After self-attention, each position passes through the same feedforward network. This network does not mix positions. It transforms each token representation independently.

A standard transformer feedforward network is

\text{FFN}(x) = W_2 \phi(W_1x + b_1) + b_2,

where $\phi$ is a nonlinear activation function such as ReLU, GELU, or SiLU.

Usually,

W_1 \in \mathbb{R}^{D \times D_{\text{ff}}}, \quad W_2 \in \mathbb{R}^{D_{\text{ff}} \times D}.

The hidden dimension $D_{\text{ff}}$ is often larger than $D$ . A common choice is

D_{\text{ff}} = 4D.

The feedforward network increases the expressive power of the encoder. Self-attention mixes information across tokens. The feedforward network transforms the mixed information at each token position.

Residual Connections

Residual connections add the input of a sublayer to its output. If a sublayer is $F$ , the residual form is

Y = X + F(X).

Residual connections make optimization easier. They allow information and gradients to pass through many layers with less distortion.

In a transformer encoder, residual connections are used around both self-attention and the feedforward network:

Y = X + \text{SelfAttention}(\cdot),

H = Y + \text{FeedForward}(\cdot).

Without residual connections, deep transformer stacks are much harder to train.

Layer Normalization

Layer normalization normalizes activations across the feature dimension. For a vector $x\in\mathbb{R}^D$ , layer normalization computes

\mu = \frac{1}{D}\sum_{i=1}^{D}x_i,

\sigma^2 = \frac{1}{D}\sum_{i=1}^{D}(x_i-\mu)^2,

\text{LayerNorm}(x)_i = \gamma_i \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta_i.

The parameters $\gamma$ and $\beta$ are learned vectors. The small constant $\epsilon$ prevents division by zero.

Layer normalization is preferred over batch normalization in transformers because sequence lengths and batch structures vary, and the computation should remain stable even with small batch sizes.

Pre-Norm and Post-Norm Encoders

There are two common ways to place layer normalization.

In a post-norm transformer layer:

Y = \text{LayerNorm}(X + \text{SelfAttention}(X)),

H = \text{LayerNorm}(Y + \text{FeedForward}(Y)).

In a pre-norm transformer layer:

Y = X + \text{SelfAttention}(\text{LayerNorm}(X)),

H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).

Post-norm was used in the original transformer architecture. Pre-norm is common in modern deep transformers because it tends to improve training stability.

The difference matters most when the number of layers is large. Pre-norm gives gradients a cleaner path through residual connections.

Attention Masks

Encoder attention often uses masks. A mask controls which positions can be attended to.

The most common encoder mask is a padding mask. In a batch, sequences may have different lengths. Shorter sequences are padded so all examples have the same length. The model should ignore padding positions.

Suppose a batch contains token IDs:

tokens = torch.tensor([
    [101, 2009, 2003, 2204, 102],
    [101, 7592, 102,    0,   0],
])

Here 0 may represent padding. The attention mask is

attention_mask = (tokens != 0)
print(attention_mask)

which gives

tensor([
    [ True,  True,  True,  True,  True],
    [ True,  True,  True, False, False]
])

The mask prevents real tokens from attending to padding tokens. It also prevents padding positions from contributing meaningful outputs.

Unlike decoder self-attention, encoder self-attention usually does not use a causal mask. The encoder is allowed to see both past and future positions.

A PyTorch Encoder Layer

PyTorch provides nn.TransformerEncoderLayer, but implementing a small encoder layer helps clarify the structure.

import torch
from torch import nn

class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        self.norm1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True,
        )
        self.drop1 = nn.Dropout(dropout)

        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.drop2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor | None = None):
        # x: [B, T, D]
        # key_padding_mask: [B, T], True means "ignore this token"

        y = self.norm1(x)

        attn_out, _ = self.attn(
            y, y, y,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )

        x = x + self.drop1(attn_out)

        y = self.norm2(x)
        ffn_out = self.ffn(y)

        x = x + self.drop2(ffn_out)
        return x

This is a pre-norm encoder layer. It preserves shape:

B, T, D = 8, 32, 256

layer = EncoderLayer(d_model=D, n_heads=8, d_ff=1024)
x = torch.randn(B, T, D)

out = layer(x)
print(out.shape)  # torch.Size([8, 32, 256])

The input and output shapes match. This makes it easy to stack layers.

Stacking Encoder Layers

A transformer encoder usually contains $L$ identical layer types, each with separate parameters.

H^{(0)} = X,

H^{(\ell)} = \text{EncoderLayer}^{(\ell)}(H^{(\ell-1)}), \quad \ell=1,\ldots,L.

The final output is

H = H^{(L)}.

In PyTorch:

class TransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_len: int,
        d_model: int,
        n_heads: int,
        d_ff: int,
        n_layers: int,
        dropout: float = 0.1,
        pad_id: int = 0,
    ):
        super().__init__()

        self.pad_id = pad_id
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])

        self.norm = nn.LayerNorm(d_model)

    def forward(self, tokens: torch.Tensor):
        # tokens: [B, T]
        B, T = tokens.shape

        positions = torch.arange(T, device=tokens.device)
        positions = positions.unsqueeze(0).expand(B, T)

        x = self.token_emb(tokens) + self.pos_emb(positions)

        key_padding_mask = tokens.eq(self.pad_id)

        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask)

        return self.norm(x)

Example:

model = TransformerEncoder(
    vocab_size=30_000,
    max_len=512,
    d_model=256,
    n_heads=8,
    d_ff=1024,
    n_layers=6,
)

tokens = torch.randint(0, 30_000, (4, 128))
out = model(tokens)

print(out.shape)  # torch.Size([4, 128, 256])

The encoder returns one vector for each token position.

Using Encoder Outputs

The output of an encoder can be used in different ways.

For token-level tasks, each output vector is used directly. Named entity recognition, part-of-speech tagging, and token classification use this pattern.

H\in\mathbb{R}^{B\times T\times D},

a token classifier maps each position to class logits:

Z = HW_{\text{cls}} + b_{\text{cls}},

where

Z\in\mathbb{R}^{B\times T\times K}.

For sequence-level tasks, the model needs one vector for the whole sequence. Common choices include:

Method	Description
First-token pooling	Use the output at position 0
Mean pooling	Average output vectors over valid tokens
Max pooling	Take the maximum over positions
Attention pooling	Learn a weighted average over positions

BERT-style models often use a special classification token. The final representation of this token is passed to a classifier.

Mean pooling is common for sentence embedding models:

def mean_pool(hidden, attention_mask):
    # hidden: [B, T, D]
    # attention_mask: [B, T], True for valid tokens

    mask = attention_mask.unsqueeze(-1).float()
    summed = (hidden * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1.0)

    return summed / counts

Encoder Complexity

The main computational cost of a transformer encoder comes from self-attention.

For sequence length $T$ , attention constructs a $T \times T$ score matrix for each example and head. Therefore the attention cost grows quadratically with sequence length:

O(T^2D).

The feedforward network cost is roughly

O(TDD_{\text{ff}}).

For short and moderate sequences, feedforward layers may dominate compute. For very long sequences, attention memory and compute become major bottlenecks.

This quadratic cost motivates efficient transformer variants, including sparse attention, linear attention, sliding-window attention, low-rank attention, and state-space alternatives.

Encoder Versus Decoder

Transformer encoders and decoders share many components, but they serve different purposes.

Component	Encoder	Decoder
Self-attention	Bidirectional	Causal
Future tokens visible	Yes	No
Main use	Understanding and representation	Generation
Typical outputs	Contextual embeddings	Next-token logits
Example models	BERT, ViT	GPT-style models

An encoder is best when the full input is available before prediction. A decoder is best when the model must generate outputs step by step.

Encoder-decoder transformers combine both. The encoder reads the source sequence. The decoder generates the target sequence while attending to encoder outputs.

Vision Transformer Encoders

A Vision Transformer uses an encoder over image patches.

An image is split into patches. Each patch is flattened and projected into a vector. The patch sequence is then passed into a transformer encoder.

For an image batch

X\in\mathbb{R}^{B\times C\times H\times W},

with patch size $P$ , the number of patches is

T = \frac{H}{P}\cdot\frac{W}{P}.

Each patch becomes one token. The encoder processes the patch sequence just as it would process a text sequence.

This shows that transformer encoders are general sequence processors. The sequence elements may be words, image patches, audio frames, graph nodes, or retrieved documents.

Practical Design Choices

Important encoder hyperparameters include:

Hyperparameter	Meaning
$D$	Model dimension
$L$	Number of encoder layers
$h$	Number of attention heads
$D_{\text{ff}}$	Feedforward hidden dimension
$T$	Maximum sequence length
Dropout rate	Regularization strength
Positional encoding type	How order information is represented

Common constraints:

The model dimension $D$ should usually be divisible by the number of heads $h$ . Larger $D$ increases representation capacity. Larger $L$ increases depth. Larger $T$ increases memory use sharply because attention scales with $T^2$ .

A small encoder may use:

D=256,\quad L=6,\quad h=8,\quad D_{\text{ff}}=1024.

A base-size encoder may use:

D=768,\quad L=12,\quad h=12,\quad D_{\text{ff}}=3072.

A larger encoder increases accuracy on many tasks but also increases training cost, inference latency, and memory use.

Summary

A transformer encoder maps an input sequence to a contextualized output sequence. It uses bidirectional self-attention to mix information across positions and feedforward networks to transform each position independently.

The core encoder layer consists of multi-head self-attention, a feedforward network, residual connections, layer normalization, dropout, and masking. Stacking many such layers gives a deep sequence representation model.

Encoders are suited to understanding tasks because each output position can use the full input context. They are used in text encoders, image encoders, embedding models, retrieval systems, speech models, and multimodal representation learners.