Skip to content

Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors. It is used when the whole input sequence is available at once and each position may attend to every other position.

Transformer encoders are common in text understanding, image understanding, speech representation learning, retrieval, classification, tagging, and multimodal systems. BERT-style language models, Vision Transformers, and many embedding models are based on encoder architectures.

The Encoder Problem

Suppose we have an input sequence of length TT. Each element of the sequence is represented as a vector of dimension DD. We write the input as

XRB×T×D, X \in \mathbb{R}^{B \times T \times D},

where BB is the batch size, TT is the sequence length, and DD is the model dimension.

For a sentence, each position may correspond to one token. For an image, each position may correspond to one patch. For audio, each position may correspond to one frame or segment.

The goal of the encoder is to produce another sequence

HRB×T×D. H \in \mathbb{R}^{B \times T \times D}.

The output has the same sequence length as the input, but each output vector contains information from the whole sequence.

For example, in the sentence

The bank approved the loan \text{The bank approved the loan}

the representation of the word “bank” should depend on nearby and distant words. The word “loan” helps determine that “bank” refers to a financial institution. A transformer encoder computes such contextual representations by allowing each token to attend to other tokens.

Encoder Input Representation

A transformer encoder does not process raw tokens directly. Tokens are first mapped to vectors by an embedding layer.

If the vocabulary size is VV, the embedding table is

ERV×D. E \in \mathbb{R}^{V \times D}.

A token ID ii selects row EiE_i, which becomes the input vector for that token.

For a batch of token IDs

tokensNB×T, \text{tokens} \in \mathbb{N}^{B \times T},

the embedding layer produces

XtokRB×T×D. X_{\text{tok}} \in \mathbb{R}^{B \times T \times D}.

In PyTorch:

import torch
from torch import nn

B, T, V, D = 4, 16, 30_000, 768

tokens = torch.randint(0, V, (B, T))
embedding = nn.Embedding(V, D)

x_tok = embedding(tokens)
print(x_tok.shape)  # torch.Size([4, 16, 768])

Token embeddings alone do not tell the model where a token appears in the sequence. A transformer encoder therefore adds positional information.

A common form is learned positional embedding:

X=Xtok+Xpos, X = X_{\text{tok}} + X_{\text{pos}},

where

XposR1×T×D. X_{\text{pos}} \in \mathbb{R}^{1 \times T \times D}.

In PyTorch:

pos_embedding = nn.Embedding(T, D)

positions = torch.arange(T)
x_pos = pos_embedding(positions)[None, :, :]

x = x_tok + x_pos
print(x.shape)  # torch.Size([4, 16, 768])

The same positional vectors are broadcast across the batch.

Encoder Layer Structure

A standard transformer encoder is built by stacking several encoder layers. Each layer contains two main sublayers:

  1. Multi-head self-attention.
  2. Feedforward network.

Each sublayer is wrapped with a residual connection and normalization.

The common pre-normalization encoder layer has the form

Y=X+SelfAttention(LayerNorm(X)), Y = X + \text{SelfAttention}(\text{LayerNorm}(X)), H=Y+FeedForward(LayerNorm(Y)). H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).

The output HH becomes the input to the next encoder layer.

The residual connections preserve information and improve gradient flow. Layer normalization stabilizes the scale of activations. Self-attention mixes information across sequence positions. The feedforward network applies a nonlinear transformation independently at each position.

Self-Attention in the Encoder

Self-attention is the central operation in a transformer encoder. Each position creates three vectors: a query, a key, and a value.

Given input

XRB×T×D, X \in \mathbb{R}^{B \times T \times D},

we compute

Q=XWQ,K=XWK,V=XWV, Q = XW_Q,\quad K = XW_K,\quad V = XW_V,

where

WQ,WK,WVRD×dk. W_Q,W_K,W_V \in \mathbb{R}^{D \times d_k}.

The attention scores are computed by comparing queries and keys:

S=QKdk. S = \frac{QK^\top}{\sqrt{d_k}}.

The softmax converts scores into attention weights:

A=softmax(S). A = \text{softmax}(S).

The output is a weighted sum of values:

O=AV. O = AV.

The complete scaled dot-product attention operation is

Attention(Q,K,V)=softmax(QKdk)V. \text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V.

In an encoder, self-attention means that QQ, KK, and VV all come from the same sequence. Every token can compare itself with every other token.

Bidirectional Context

A transformer encoder uses bidirectional attention. This means that token tt can attend to tokens before and after it.

For a sequence

x1,x2,,xT, x_1,x_2,\ldots,x_T,

the output representation at position tt may depend on

x1,,xt1,xt,xt+1,,xT. x_1,\ldots,x_{t-1},x_t,x_{t+1},\ldots,x_T.

This is different from an autoregressive transformer decoder, where position tt may only attend to positions 1,,t1,\ldots,t.

Encoder attention is therefore well suited to understanding tasks. The model can use full context when computing each representation.

Examples include:

TaskWhy an encoder fits
Text classificationThe whole text is available before prediction
Named entity recognitionEach token should use left and right context
Sentence embeddingThe model summarizes a complete input
Image classificationAll image patches are visible together
RetrievalThe full query or document is encoded into a representation

Multi-Head Attention

Single-head attention computes one attention pattern. Multi-head attention computes several attention patterns in parallel.

The model dimension DD is split across hh heads. If there are hh heads, each head usually has dimension

dh=Dh. d_h = \frac{D}{h}.

Each head has its own query, key, and value projections:

Qi=XWQi,Ki=XWKi,Vi=XWVi. Q_i = XW_{Q_i},\quad K_i = XW_{K_i},\quad V_i = XW_{V_i}.

Each head computes attention separately:

Oi=Attention(Qi,Ki,Vi). O_i = \text{Attention}(Q_i,K_i,V_i).

The outputs are concatenated and projected:

O=Concat(O1,,Oh)WO. O = \text{Concat}(O_1,\ldots,O_h)W_O.

Multi-head attention allows the encoder to represent multiple relationships at once. One head may focus on local syntax. Another may focus on long-range dependencies. Another may focus on delimiter tokens, image regions, or semantic similarity. These interpretations are approximate, but they describe why multiple heads are useful.

Feedforward Network

After self-attention, each position passes through the same feedforward network. This network does not mix positions. It transforms each token representation independently.

A standard transformer feedforward network is

FFN(x)=W2ϕ(W1x+b1)+b2, \text{FFN}(x) = W_2 \phi(W_1x + b_1) + b_2,

where ϕ\phi is a nonlinear activation function such as ReLU, GELU, or SiLU.

Usually,

W1RD×Dff,W2RDff×D. W_1 \in \mathbb{R}^{D \times D_{\text{ff}}}, \quad W_2 \in \mathbb{R}^{D_{\text{ff}} \times D}.

The hidden dimension DffD_{\text{ff}} is often larger than DD. A common choice is

Dff=4D. D_{\text{ff}} = 4D.

The feedforward network increases the expressive power of the encoder. Self-attention mixes information across tokens. The feedforward network transforms the mixed information at each token position.

Residual Connections

Residual connections add the input of a sublayer to its output. If a sublayer is FF, the residual form is

Y=X+F(X). Y = X + F(X).

Residual connections make optimization easier. They allow information and gradients to pass through many layers with less distortion.

In a transformer encoder, residual connections are used around both self-attention and the feedforward network:

Y=X+SelfAttention(), Y = X + \text{SelfAttention}(\cdot), H=Y+FeedForward(). H = Y + \text{FeedForward}(\cdot).

Without residual connections, deep transformer stacks are much harder to train.

Layer Normalization

Layer normalization normalizes activations across the feature dimension. For a vector xRDx\in\mathbb{R}^D, layer normalization computes

μ=1Di=1Dxi, \mu = \frac{1}{D}\sum_{i=1}^{D}x_i, σ2=1Di=1D(xiμ)2, \sigma^2 = \frac{1}{D}\sum_{i=1}^{D}(x_i-\mu)^2, LayerNorm(x)i=γixiμσ2+ϵ+βi. \text{LayerNorm}(x)_i = \gamma_i \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta_i.

The parameters γ\gamma and β\beta are learned vectors. The small constant ϵ\epsilon prevents division by zero.

Layer normalization is preferred over batch normalization in transformers because sequence lengths and batch structures vary, and the computation should remain stable even with small batch sizes.

Pre-Norm and Post-Norm Encoders

There are two common ways to place layer normalization.

In a post-norm transformer layer:

Y=LayerNorm(X+SelfAttention(X)), Y = \text{LayerNorm}(X + \text{SelfAttention}(X)), H=LayerNorm(Y+FeedForward(Y)). H = \text{LayerNorm}(Y + \text{FeedForward}(Y)).

In a pre-norm transformer layer:

Y=X+SelfAttention(LayerNorm(X)), Y = X + \text{SelfAttention}(\text{LayerNorm}(X)), H=Y+FeedForward(LayerNorm(Y)). H = Y + \text{FeedForward}(\text{LayerNorm}(Y)).

Post-norm was used in the original transformer architecture. Pre-norm is common in modern deep transformers because it tends to improve training stability.

The difference matters most when the number of layers is large. Pre-norm gives gradients a cleaner path through residual connections.

Attention Masks

Encoder attention often uses masks. A mask controls which positions can be attended to.

The most common encoder mask is a padding mask. In a batch, sequences may have different lengths. Shorter sequences are padded so all examples have the same length. The model should ignore padding positions.

Suppose a batch contains token IDs:

tokens = torch.tensor([
    [101, 2009, 2003, 2204, 102],
    [101, 7592, 102,    0,   0],
])

Here 0 may represent padding. The attention mask is

attention_mask = (tokens != 0)
print(attention_mask)

which gives

tensor([
    [ True,  True,  True,  True,  True],
    [ True,  True,  True, False, False]
])

The mask prevents real tokens from attending to padding tokens. It also prevents padding positions from contributing meaningful outputs.

Unlike decoder self-attention, encoder self-attention usually does not use a causal mask. The encoder is allowed to see both past and future positions.

A PyTorch Encoder Layer

PyTorch provides nn.TransformerEncoderLayer, but implementing a small encoder layer helps clarify the structure.

import torch
from torch import nn

class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        self.norm1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True,
        )
        self.drop1 = nn.Dropout(dropout)

        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.drop2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor | None = None):
        # x: [B, T, D]
        # key_padding_mask: [B, T], True means "ignore this token"

        y = self.norm1(x)

        attn_out, _ = self.attn(
            y, y, y,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )

        x = x + self.drop1(attn_out)

        y = self.norm2(x)
        ffn_out = self.ffn(y)

        x = x + self.drop2(ffn_out)
        return x

This is a pre-norm encoder layer. It preserves shape:

B, T, D = 8, 32, 256

layer = EncoderLayer(d_model=D, n_heads=8, d_ff=1024)
x = torch.randn(B, T, D)

out = layer(x)
print(out.shape)  # torch.Size([8, 32, 256])

The input and output shapes match. This makes it easy to stack layers.

Stacking Encoder Layers

A transformer encoder usually contains LL identical layer types, each with separate parameters.

H(0)=X, H^{(0)} = X, H()=EncoderLayer()(H(1)),=1,,L. H^{(\ell)} = \text{EncoderLayer}^{(\ell)}(H^{(\ell-1)}), \quad \ell=1,\ldots,L.

The final output is

H=H(L). H = H^{(L)}.

In PyTorch:

class TransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_len: int,
        d_model: int,
        n_heads: int,
        d_ff: int,
        n_layers: int,
        dropout: float = 0.1,
        pad_id: int = 0,
    ):
        super().__init__()

        self.pad_id = pad_id
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])

        self.norm = nn.LayerNorm(d_model)

    def forward(self, tokens: torch.Tensor):
        # tokens: [B, T]
        B, T = tokens.shape

        positions = torch.arange(T, device=tokens.device)
        positions = positions.unsqueeze(0).expand(B, T)

        x = self.token_emb(tokens) + self.pos_emb(positions)

        key_padding_mask = tokens.eq(self.pad_id)

        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask)

        return self.norm(x)

Example:

model = TransformerEncoder(
    vocab_size=30_000,
    max_len=512,
    d_model=256,
    n_heads=8,
    d_ff=1024,
    n_layers=6,
)

tokens = torch.randint(0, 30_000, (4, 128))
out = model(tokens)

print(out.shape)  # torch.Size([4, 128, 256])

The encoder returns one vector for each token position.

Using Encoder Outputs

The output of an encoder can be used in different ways.

For token-level tasks, each output vector is used directly. Named entity recognition, part-of-speech tagging, and token classification use this pattern.

If

HRB×T×D, H\in\mathbb{R}^{B\times T\times D},

a token classifier maps each position to class logits:

Z=HWcls+bcls, Z = HW_{\text{cls}} + b_{\text{cls}},

where

ZRB×T×K. Z\in\mathbb{R}^{B\times T\times K}.

For sequence-level tasks, the model needs one vector for the whole sequence. Common choices include:

MethodDescription
First-token poolingUse the output at position 0
Mean poolingAverage output vectors over valid tokens
Max poolingTake the maximum over positions
Attention poolingLearn a weighted average over positions

BERT-style models often use a special classification token. The final representation of this token is passed to a classifier.

Mean pooling is common for sentence embedding models:

def mean_pool(hidden, attention_mask):
    # hidden: [B, T, D]
    # attention_mask: [B, T], True for valid tokens

    mask = attention_mask.unsqueeze(-1).float()
    summed = (hidden * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1.0)

    return summed / counts

Encoder Complexity

The main computational cost of a transformer encoder comes from self-attention.

For sequence length TT, attention constructs a T×TT \times T score matrix for each example and head. Therefore the attention cost grows quadratically with sequence length:

O(T2D). O(T^2D).

The feedforward network cost is roughly

O(TDDff). O(TDD_{\text{ff}}).

For short and moderate sequences, feedforward layers may dominate compute. For very long sequences, attention memory and compute become major bottlenecks.

This quadratic cost motivates efficient transformer variants, including sparse attention, linear attention, sliding-window attention, low-rank attention, and state-space alternatives.

Encoder Versus Decoder

Transformer encoders and decoders share many components, but they serve different purposes.

ComponentEncoderDecoder
Self-attentionBidirectionalCausal
Future tokens visibleYesNo
Main useUnderstanding and representationGeneration
Typical outputsContextual embeddingsNext-token logits
Example modelsBERT, ViTGPT-style models

An encoder is best when the full input is available before prediction. A decoder is best when the model must generate outputs step by step.

Encoder-decoder transformers combine both. The encoder reads the source sequence. The decoder generates the target sequence while attending to encoder outputs.

Vision Transformer Encoders

A Vision Transformer uses an encoder over image patches.

An image is split into patches. Each patch is flattened and projected into a vector. The patch sequence is then passed into a transformer encoder.

For an image batch

XRB×C×H×W, X\in\mathbb{R}^{B\times C\times H\times W},

with patch size PP, the number of patches is

T=HPWP. T = \frac{H}{P}\cdot\frac{W}{P}.

Each patch becomes one token. The encoder processes the patch sequence just as it would process a text sequence.

This shows that transformer encoders are general sequence processors. The sequence elements may be words, image patches, audio frames, graph nodes, or retrieved documents.

Practical Design Choices

Important encoder hyperparameters include:

HyperparameterMeaning
DDModel dimension
LLNumber of encoder layers
hhNumber of attention heads
DffD_{\text{ff}}Feedforward hidden dimension
TTMaximum sequence length
Dropout rateRegularization strength
Positional encoding typeHow order information is represented

Common constraints:

The model dimension DD should usually be divisible by the number of heads hh. Larger DD increases representation capacity. Larger LL increases depth. Larger TT increases memory use sharply because attention scales with T2T^2.

A small encoder may use:

D=256,L=6,h=8,Dff=1024. D=256,\quad L=6,\quad h=8,\quad D_{\text{ff}}=1024.

A base-size encoder may use:

D=768,L=12,h=12,Dff=3072. D=768,\quad L=12,\quad h=12,\quad D_{\text{ff}}=3072.

A larger encoder increases accuracy on many tasks but also increases training cost, inference latency, and memory use.

Summary

A transformer encoder maps an input sequence to a contextualized output sequence. It uses bidirectional self-attention to mix information across positions and feedforward networks to transform each position independently.

The core encoder layer consists of multi-head self-attention, a feedforward network, residual connections, layer normalization, dropout, and masking. Stacking many such layers gives a deep sequence representation model.

Encoders are suited to understanding tasks because each output position can use the full input context. They are used in text encoders, image encoders, embedding models, retrieval systems, speech models, and multimodal representation learners.