Skip to content

Transformer Encoders

A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation.

A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation. Each output position can contain information from every visible input position. Encoder models are useful when the full input is available before prediction.

Examples include text classification, named entity recognition, sentence embedding, document ranking, image classification with vision transformers, and many multimodal encoders.

Given an input tensor

XRB×T×D, X \in \mathbb{R}^{B \times T \times D},

a transformer encoder returns

HRB×T×D. H \in \mathbb{R}^{B \times T \times D}.

The sequence length and model dimension usually stay unchanged across encoder layers.

Encoder Layer Structure

A standard transformer encoder layer has two main sublayers:

SublayerPurpose
Multi-head self-attentionAllows tokens to exchange information
Feedforward networkApplies nonlinear transformation to each token

Each sublayer is wrapped with residual connections and layer normalization.

The common modern form is pre-normalization:

X1=X+SelfAttention(LayerNorm(X)), X_1 = X + \operatorname{SelfAttention}(\operatorname{LayerNorm}(X)), Y=X1+FFN(LayerNorm(X1)). Y = X_1 + \operatorname{FFN}(\operatorname{LayerNorm}(X_1)).

The output YY has the same shape as XX.

Pre-normalization is widely used because it tends to train deep transformers more stably than the original post-normalization layout.

Self-Attention in the Encoder

Encoder self-attention is usually bidirectional. Each token can attend to all non-padding tokens.

For a sequence:

The cat sat on the mat

the token cat can attend to The, sat, on, the, and mat. There is no causal restriction because the whole sequence is already known.

The only common mask is the padding mask. If the batch contains padded sequences, the attention layer should hide padding tokens:

[The, cat, sat, <pad>, <pad>]
[The, dog, ran, outside, today]

The padding positions should not contribute useful information.

Feedforward Network

The feedforward network is applied independently to each position. It has the same weights for all positions.

A common form is:

FFN(x)=W2σ(W1x+b1)+b2. \operatorname{FFN}(x) = W_2 \sigma(W_1x + b_1) + b_2.

Here xRDx\in\mathbb{R}^{D}, W1W_1 expands the dimension, σ\sigma is an activation function such as GELU or ReLU, and W2W_2 projects back to DD.

Usually,

Dff4D. D_{\text{ff}} \approx 4D.

For example, if D=768D=768, the feedforward hidden dimension may be 3072.

In PyTorch:

from torch import nn

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

Although self-attention mixes information across tokens, the feedforward network transforms each token vector independently. Together, these two operations give the encoder both communication and computation.

Residual Connections

Residual connections add the input of a sublayer to its output:

Y=X+F(X). Y = X + F(X).

They help gradients flow through deep networks and allow each layer to learn an incremental update rather than a complete replacement.

In transformer encoders, residual connections preserve shape. The self-attention sublayer maps [B, T, D] to [B, T, D], so it can be added back to the input. The feedforward sublayer also maps [B, T, D] to [B, T, D].

Layer Normalization

Layer normalization normalizes each token vector across its feature dimension. For a vector xRDx\in\mathbb{R}^{D}, layer normalization computes a normalized vector using the mean and variance of that vector’s entries.

In PyTorch:

norm = nn.LayerNorm(d_model)

Layer normalization is preferred over batch normalization in transformers because sequence lengths can vary, batch sizes can be small, and token-level normalization is more natural than batch-level statistics.

A Minimal Encoder Layer in PyTorch

import torch
from torch import nn

class TransformerEncoderLayer(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True,
        )

        self.ffn = FeedForward(d_model, d_ff, dropout)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, key_padding_mask=None):
        # x: [B, T, D]
        # key_padding_mask: [B, T], True means masked/padding

        h = self.norm1(x)
        attn_out, _ = self.self_attn(
            query=h,
            key=h,
            value=h,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )
        x = x + self.dropout1(attn_out)

        h = self.norm2(x)
        ffn_out = self.ffn(h)
        x = x + self.dropout2(ffn_out)

        return x

Example:

B = 4
T = 16
D = 128

x = torch.randn(B, T, D)

layer = TransformerEncoderLayer(
    d_model=128,
    num_heads=8,
    d_ff=512,
)

y = layer(x)

print(y.shape)  # torch.Size([4, 16, 128])

The encoder layer preserves the sequence shape.

Padding Masks in PyTorch

PyTorch’s nn.MultiheadAttention uses key_padding_mask with shape [B, T].

A value of True means the key position should be ignored.

tokens = torch.tensor([
    [101, 2009, 2003, 102, 0, 0],
    [101, 2023, 2003, 1037, 3231, 102],
])

pad_token_id = 0
key_padding_mask = tokens.eq(pad_token_id)

print(key_padding_mask)

Output:

tensor([[False, False, False, False,  True,  True],
        [False, False, False, False, False, False]])

The mask is then passed to the encoder layer:

x = torch.randn(2, 6, 128)

y = layer(x, key_padding_mask=key_padding_mask)

The padding tokens are still present in the output tensor, but real tokens cannot attend to them.

Stacking Encoder Layers

A transformer encoder stacks multiple encoder layers:

H(0)=X, H^{(0)} = X, H(+1)=EncoderLayer()(H()). H^{(\ell+1)} = \operatorname{EncoderLayer}^{(\ell)}(H^{(\ell)}).

The depth controls how many rounds of token interaction and nonlinear transformation occur.

class TransformerEncoder(nn.Module):
    def __init__(
        self,
        num_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.layers = nn.ModuleList([
            TransformerEncoderLayer(
                d_model=d_model,
                num_heads=num_heads,
                d_ff=d_ff,
                dropout=dropout,
            )
            for _ in range(num_layers)
        ])

        self.final_norm = nn.LayerNorm(d_model)

    def forward(self, x, key_padding_mask=None):
        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask)

        return self.final_norm(x)

Example:

encoder = TransformerEncoder(
    num_layers=6,
    d_model=256,
    num_heads=8,
    d_ff=1024,
)

x = torch.randn(2, 32, 256)
y = encoder(x)

print(y.shape)  # torch.Size([2, 32, 256])

Token Embeddings and Position Embeddings

An encoder for text usually starts with token embeddings and position embeddings.

class TextTransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_length: int,
        num_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
        pad_token_id: int = 0,
    ):
        super().__init__()

        self.pad_token_id = pad_token_id

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_length, d_model)

        self.dropout = nn.Dropout(dropout)

        self.encoder = TransformerEncoder(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            d_ff=d_ff,
            dropout=dropout,
        )

    def forward(self, token_ids):
        # token_ids: [B, T]
        B, T = token_ids.shape

        positions = torch.arange(T, device=token_ids.device)
        positions = positions.unsqueeze(0).expand(B, T)

        x = self.token_embedding(token_ids)
        p = self.position_embedding(positions)

        x = self.dropout(x + p)

        key_padding_mask = token_ids.eq(self.pad_token_id)

        return self.encoder(x, key_padding_mask=key_padding_mask)

This module returns contextual token representations:

model = TextTransformerEncoder(
    vocab_size=30_000,
    max_length=512,
    num_layers=6,
    d_model=256,
    num_heads=8,
    d_ff=1024,
)

token_ids = torch.randint(0, 30_000, (4, 128))
hidden = model(token_ids)

print(hidden.shape)  # torch.Size([4, 128, 256])

Encoder Outputs

The encoder output has one vector per input token:

HRB×T×D. H \in \mathbb{R}^{B \times T \times D}.

Different tasks use this output differently.

For token classification, such as named entity recognition, we classify each token vector:

logits = classifier(hidden)

The output shape is:

[B, T, num_labels]

For sequence classification, we need one vector for the whole sequence. Common choices include using a special class token, mean pooling, or attention pooling.

With a class token:

cls_repr = hidden[:, 0, :]
logits = classifier(cls_repr)

With mean pooling over non-padding tokens:

mask = token_ids.ne(pad_token_id).float()
mask = mask.unsqueeze(-1)

pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
logits = classifier(pooled)

Encoder Models

Encoder-only transformers are common when the task requires understanding an input rather than generating text token by token.

Examples include:

Model familyMain use
BERT-style encodersText understanding
RoBERTa-style encodersStronger masked-language pretraining
Sentence transformersEmbeddings and retrieval
Vision transformersImage classification and visual representation
Audio encodersSpeech and audio representation
Multimodal encodersJoint image-text or audio-text representation

These models differ in data, pretraining objective, tokenizer, positional method, architecture size, and downstream task.

Encoder Versus Decoder

An encoder sees the full input. A decoder generates outputs autoregressively.

PropertyEncoderDecoder
Attention maskUsually bidirectionalUsually causal
Main taskUnderstand inputGenerate output
Common outputContextual representationsNext-token logits
Example useClassification, retrieval, taggingLanguage modeling, text generation

Encoder-decoder models combine both. The encoder processes the source sequence. The decoder generates the target sequence while attending to encoder outputs.

Summary

A transformer encoder maps an input sequence to contextual representations. Each layer contains bidirectional multi-head self-attention, a position-wise feedforward network, residual connections, and layer normalization.

Encoder models are well suited to tasks where the full input is available: classification, tagging, retrieval, ranking, image recognition, and representation learning.

In PyTorch, a basic encoder can be built from nn.MultiheadAttention, nn.LayerNorm, feedforward layers, residual connections, and padding masks.