A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.
A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors. It is used when the whole input sequence is available at once and each position may attend to every other position.
Transformer encoders are common in text understanding, image understanding, speech representation learning, retrieval, classification, tagging, and multimodal systems. BERT-style language models, Vision Transformers, and many embedding models are based on encoder architectures.
The Encoder Problem
Suppose we have an input sequence of length . Each element of the sequence is represented as a vector of dimension . We write the input as
where is the batch size, is the sequence length, and is the model dimension.
For a sentence, each position may correspond to one token. For an image, each position may correspond to one patch. For audio, each position may correspond to one frame or segment.
The goal of the encoder is to produce another sequence
The output has the same sequence length as the input, but each output vector contains information from the whole sequence.
For example, in the sentence
the representation of the word “bank” should depend on nearby and distant words. The word “loan” helps determine that “bank” refers to a financial institution. A transformer encoder computes such contextual representations by allowing each token to attend to other tokens.
Encoder Input Representation
A transformer encoder does not process raw tokens directly. Tokens are first mapped to vectors by an embedding layer.
If the vocabulary size is , the embedding table is
A token ID selects row , which becomes the input vector for that token.
For a batch of token IDs
the embedding layer produces
In PyTorch:
import torch
from torch import nn
B, T, V, D = 4, 16, 30_000, 768
tokens = torch.randint(0, V, (B, T))
embedding = nn.Embedding(V, D)
x_tok = embedding(tokens)
print(x_tok.shape) # torch.Size([4, 16, 768])Token embeddings alone do not tell the model where a token appears in the sequence. A transformer encoder therefore adds positional information.
A common form is learned positional embedding:
where
In PyTorch:
pos_embedding = nn.Embedding(T, D)
positions = torch.arange(T)
x_pos = pos_embedding(positions)[None, :, :]
x = x_tok + x_pos
print(x.shape) # torch.Size([4, 16, 768])The same positional vectors are broadcast across the batch.
Encoder Layer Structure
A standard transformer encoder is built by stacking several encoder layers. Each layer contains two main sublayers:
- Multi-head self-attention.
- Feedforward network.
Each sublayer is wrapped with a residual connection and normalization.
The common pre-normalization encoder layer has the form
The output becomes the input to the next encoder layer.
The residual connections preserve information and improve gradient flow. Layer normalization stabilizes the scale of activations. Self-attention mixes information across sequence positions. The feedforward network applies a nonlinear transformation independently at each position.
Self-Attention in the Encoder
Self-attention is the central operation in a transformer encoder. Each position creates three vectors: a query, a key, and a value.
Given input
we compute
where
The attention scores are computed by comparing queries and keys:
The softmax converts scores into attention weights:
The output is a weighted sum of values:
The complete scaled dot-product attention operation is
In an encoder, self-attention means that , , and all come from the same sequence. Every token can compare itself with every other token.
Bidirectional Context
A transformer encoder uses bidirectional attention. This means that token can attend to tokens before and after it.
For a sequence
the output representation at position may depend on
This is different from an autoregressive transformer decoder, where position may only attend to positions .
Encoder attention is therefore well suited to understanding tasks. The model can use full context when computing each representation.
Examples include:
| Task | Why an encoder fits |
|---|---|
| Text classification | The whole text is available before prediction |
| Named entity recognition | Each token should use left and right context |
| Sentence embedding | The model summarizes a complete input |
| Image classification | All image patches are visible together |
| Retrieval | The full query or document is encoded into a representation |
Multi-Head Attention
Single-head attention computes one attention pattern. Multi-head attention computes several attention patterns in parallel.
The model dimension is split across heads. If there are heads, each head usually has dimension
Each head has its own query, key, and value projections:
Each head computes attention separately:
The outputs are concatenated and projected:
Multi-head attention allows the encoder to represent multiple relationships at once. One head may focus on local syntax. Another may focus on long-range dependencies. Another may focus on delimiter tokens, image regions, or semantic similarity. These interpretations are approximate, but they describe why multiple heads are useful.
Feedforward Network
After self-attention, each position passes through the same feedforward network. This network does not mix positions. It transforms each token representation independently.
A standard transformer feedforward network is
where is a nonlinear activation function such as ReLU, GELU, or SiLU.
Usually,
The hidden dimension is often larger than . A common choice is
The feedforward network increases the expressive power of the encoder. Self-attention mixes information across tokens. The feedforward network transforms the mixed information at each token position.
Residual Connections
Residual connections add the input of a sublayer to its output. If a sublayer is , the residual form is
Residual connections make optimization easier. They allow information and gradients to pass through many layers with less distortion.
In a transformer encoder, residual connections are used around both self-attention and the feedforward network:
Without residual connections, deep transformer stacks are much harder to train.
Layer Normalization
Layer normalization normalizes activations across the feature dimension. For a vector , layer normalization computes
The parameters and are learned vectors. The small constant prevents division by zero.
Layer normalization is preferred over batch normalization in transformers because sequence lengths and batch structures vary, and the computation should remain stable even with small batch sizes.
Pre-Norm and Post-Norm Encoders
There are two common ways to place layer normalization.
In a post-norm transformer layer:
In a pre-norm transformer layer:
Post-norm was used in the original transformer architecture. Pre-norm is common in modern deep transformers because it tends to improve training stability.
The difference matters most when the number of layers is large. Pre-norm gives gradients a cleaner path through residual connections.
Attention Masks
Encoder attention often uses masks. A mask controls which positions can be attended to.
The most common encoder mask is a padding mask. In a batch, sequences may have different lengths. Shorter sequences are padded so all examples have the same length. The model should ignore padding positions.
Suppose a batch contains token IDs:
tokens = torch.tensor([
[101, 2009, 2003, 2204, 102],
[101, 7592, 102, 0, 0],
])Here 0 may represent padding. The attention mask is
attention_mask = (tokens != 0)
print(attention_mask)which gives
tensor([
[ True, True, True, True, True],
[ True, True, True, False, False]
])The mask prevents real tokens from attending to padding tokens. It also prevents padding positions from contributing meaningful outputs.
Unlike decoder self-attention, encoder self-attention usually does not use a causal mask. The encoder is allowed to see both past and future positions.
A PyTorch Encoder Layer
PyTorch provides nn.TransformerEncoderLayer, but implementing a small encoder layer helps clarify the structure.
import torch
from torch import nn
class EncoderLayer(nn.Module):
def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(
embed_dim=d_model,
num_heads=n_heads,
dropout=dropout,
batch_first=True,
)
self.drop1 = nn.Dropout(dropout)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
)
self.drop2 = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor | None = None):
# x: [B, T, D]
# key_padding_mask: [B, T], True means "ignore this token"
y = self.norm1(x)
attn_out, _ = self.attn(
y, y, y,
key_padding_mask=key_padding_mask,
need_weights=False,
)
x = x + self.drop1(attn_out)
y = self.norm2(x)
ffn_out = self.ffn(y)
x = x + self.drop2(ffn_out)
return xThis is a pre-norm encoder layer. It preserves shape:
B, T, D = 8, 32, 256
layer = EncoderLayer(d_model=D, n_heads=8, d_ff=1024)
x = torch.randn(B, T, D)
out = layer(x)
print(out.shape) # torch.Size([8, 32, 256])The input and output shapes match. This makes it easy to stack layers.
Stacking Encoder Layers
A transformer encoder usually contains identical layer types, each with separate parameters.
The final output is
In PyTorch:
class TransformerEncoder(nn.Module):
def __init__(
self,
vocab_size: int,
max_len: int,
d_model: int,
n_heads: int,
d_ff: int,
n_layers: int,
dropout: float = 0.1,
pad_id: int = 0,
):
super().__init__()
self.pad_id = pad_id
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.layers = nn.ModuleList([
EncoderLayer(d_model, n_heads, d_ff, dropout)
for _ in range(n_layers)
])
self.norm = nn.LayerNorm(d_model)
def forward(self, tokens: torch.Tensor):
# tokens: [B, T]
B, T = tokens.shape
positions = torch.arange(T, device=tokens.device)
positions = positions.unsqueeze(0).expand(B, T)
x = self.token_emb(tokens) + self.pos_emb(positions)
key_padding_mask = tokens.eq(self.pad_id)
for layer in self.layers:
x = layer(x, key_padding_mask=key_padding_mask)
return self.norm(x)Example:
model = TransformerEncoder(
vocab_size=30_000,
max_len=512,
d_model=256,
n_heads=8,
d_ff=1024,
n_layers=6,
)
tokens = torch.randint(0, 30_000, (4, 128))
out = model(tokens)
print(out.shape) # torch.Size([4, 128, 256])The encoder returns one vector for each token position.
Using Encoder Outputs
The output of an encoder can be used in different ways.
For token-level tasks, each output vector is used directly. Named entity recognition, part-of-speech tagging, and token classification use this pattern.
If
a token classifier maps each position to class logits:
where
For sequence-level tasks, the model needs one vector for the whole sequence. Common choices include:
| Method | Description |
|---|---|
| First-token pooling | Use the output at position 0 |
| Mean pooling | Average output vectors over valid tokens |
| Max pooling | Take the maximum over positions |
| Attention pooling | Learn a weighted average over positions |
BERT-style models often use a special classification token. The final representation of this token is passed to a classifier.
Mean pooling is common for sentence embedding models:
def mean_pool(hidden, attention_mask):
# hidden: [B, T, D]
# attention_mask: [B, T], True for valid tokens
mask = attention_mask.unsqueeze(-1).float()
summed = (hidden * mask).sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1.0)
return summed / countsEncoder Complexity
The main computational cost of a transformer encoder comes from self-attention.
For sequence length , attention constructs a score matrix for each example and head. Therefore the attention cost grows quadratically with sequence length:
The feedforward network cost is roughly
For short and moderate sequences, feedforward layers may dominate compute. For very long sequences, attention memory and compute become major bottlenecks.
This quadratic cost motivates efficient transformer variants, including sparse attention, linear attention, sliding-window attention, low-rank attention, and state-space alternatives.
Encoder Versus Decoder
Transformer encoders and decoders share many components, but they serve different purposes.
| Component | Encoder | Decoder |
|---|---|---|
| Self-attention | Bidirectional | Causal |
| Future tokens visible | Yes | No |
| Main use | Understanding and representation | Generation |
| Typical outputs | Contextual embeddings | Next-token logits |
| Example models | BERT, ViT | GPT-style models |
An encoder is best when the full input is available before prediction. A decoder is best when the model must generate outputs step by step.
Encoder-decoder transformers combine both. The encoder reads the source sequence. The decoder generates the target sequence while attending to encoder outputs.
Vision Transformer Encoders
A Vision Transformer uses an encoder over image patches.
An image is split into patches. Each patch is flattened and projected into a vector. The patch sequence is then passed into a transformer encoder.
For an image batch
with patch size , the number of patches is
Each patch becomes one token. The encoder processes the patch sequence just as it would process a text sequence.
This shows that transformer encoders are general sequence processors. The sequence elements may be words, image patches, audio frames, graph nodes, or retrieved documents.
Practical Design Choices
Important encoder hyperparameters include:
| Hyperparameter | Meaning |
|---|---|
| Model dimension | |
| Number of encoder layers | |
| Number of attention heads | |
| Feedforward hidden dimension | |
| Maximum sequence length | |
| Dropout rate | Regularization strength |
| Positional encoding type | How order information is represented |
Common constraints:
The model dimension should usually be divisible by the number of heads . Larger increases representation capacity. Larger increases depth. Larger increases memory use sharply because attention scales with .
A small encoder may use:
A base-size encoder may use:
A larger encoder increases accuracy on many tasks but also increases training cost, inference latency, and memory use.
Summary
A transformer encoder maps an input sequence to a contextualized output sequence. It uses bidirectional self-attention to mix information across positions and feedforward networks to transform each position independently.
The core encoder layer consists of multi-head self-attention, a feedforward network, residual connections, layer normalization, dropout, and masking. Stacking many such layers gives a deep sequence representation model.
Encoders are suited to understanding tasks because each output position can use the full input context. They are used in text encoders, image encoders, embedding models, retrieval systems, speech models, and multimodal representation learners.