Encoder-Decoder Architectures

A sequence-to-sequence model maps one sequence to another sequence. The input and output may have different lengths. This setting appears in machine translation, summarization, speech recognition, dialogue, code generation, and many other tasks.

A standard supervised sequence-to-sequence problem has an input sequence

x = (x_1, x_2, \ldots, x_S)

and an output sequence

y = (y_1, y_2, \ldots, y_T).

The input length $S$ and output length $T$ need not be equal. In translation, an English sentence with 12 tokens may correspond to a French sentence with 15 tokens. In summarization, a long document may map to a short paragraph. In speech recognition, thousands of audio frames may map to a short text sequence.

The central problem is to model the conditional distribution

p(y \mid x).

A sequence-to-sequence model usually factorizes this distribution autoregressively:

p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_1, \ldots, y_{t-1}, x).

This equation says that the model predicts the next output token using the input sequence and the output tokens already generated.

The Encoder-Decoder Pattern

The encoder-decoder architecture separates the model into two parts.

The encoder reads the input sequence and produces a representation. The decoder reads that representation and generates the output sequence.

The encoder performs the mapping

(x_1, x_2, \ldots, x_S) \longmapsto h,

where $h$ is a learned representation of the input. The decoder then performs

h \longmapsto (y_1, y_2, \ldots, y_T).

In early recurrent sequence-to-sequence models, $h$ was often a single fixed-length vector. This vector was expected to contain all information needed to generate the output. In modern architectures, $h$ is usually a sequence of hidden states rather than one vector.

For example, an encoder may produce

H = (h_1, h_2, \ldots, h_S),

where each $h_i$ represents input token $x_i$ in context. The decoder can then attend to these states while producing each output token.

Why Use an Encoder and a Decoder

Many prediction problems can be handled by a simple feedforward classifier. Sequence-to-sequence problems require more structure.

First, the input has variable length. A model must accept sequences with different numbers of tokens or frames.

Second, the output has variable length. The model must decide when to stop generating.

Third, the output is structured. The token at position $t$ depends on earlier output tokens. For example, in translation, word order, agreement, and phrase structure create dependencies across the whole output.

The encoder-decoder design handles these problems by giving each part a clear role. The encoder builds a representation of the source sequence. The decoder converts that representation into a target sequence, one token at a time.

Recurrent Encoder-Decoder Models

The original encoder-decoder architecture used recurrent neural networks.

Let the input tokens be embedded as vectors

e_1, e_2, \ldots, e_S.

An encoder RNN processes them in order:

h_i = f_{\text{enc}}(h_{i-1}, e_i).

After reading the whole input sequence, the final hidden state $h_S$ becomes a summary of the input.

The decoder is another RNN. It generates one output token at each step:

s_t = f_{\text{dec}}(s_{t-1}, u_{t-1}, h_S),

where $s_t$ is the decoder hidden state and $u_{t-1}$ is the embedding of the previous output token $y_{t-1}$ .

The decoder then produces logits over the output vocabulary:

z_t = Ws_t + b.

The probability of the next token is obtained with softmax:

p(y_t = k \mid y_{<t}, x) = \frac{\exp(z_{t,k})}{\sum_{j=1}^{V}\exp(z_{t,j})}.

Here $V$ is the vocabulary size.

Start and End Tokens

The decoder needs a first input before it has generated any output token. For this reason, sequence-to-sequence models use a special start token, often written as <bos> or <sos>.

Generation begins with

y_0 = \texttt{<bos>}.

At each step, the decoder predicts the next token. Generation stops when the decoder emits a special end token, often written as <eos>.

For example, a target sequence

I like cats

may be represented during training as

<bos> I like cats <eos>

The decoder input is

<bos> I like cats

and the decoder target is

I like cats <eos>

This offset is central to autoregressive training.

Training Objective

During training, the model is given the source sequence and the correct target sequence. The objective is to maximize the conditional likelihood of the target sequence:

\max_\theta \log p_\theta(y \mid x).

Equivalently, we minimize the negative log-likelihood:

\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x).

For classification over a vocabulary, this is the cross-entropy loss applied at every output position.

In PyTorch, the logits usually have shape

[B, T, V]

where $B$ is batch size, $T$ is target sequence length, and $V$ is vocabulary size.

The target token IDs have shape

[B, T]

A typical loss computation reshapes the tensors:

import torch
import torch.nn.functional as F

B, T, V = 32, 20, 50000

logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

This treats each target position as a vocabulary classification problem.

Teacher Forcing

During training, the decoder is usually given the correct previous token, not its own previous prediction. This method is called teacher forcing.

At training time, the decoder receives

(y_0, y_1, \ldots, y_{T-1})

and learns to predict

(y_1, y_2, \ldots, y_T).

At inference time, the correct target sequence is unknown. The decoder must feed back its own previous prediction.

This creates a gap between training and inference. During training, the model sees clean prefixes. During inference, a wrong prediction can corrupt future predictions. This problem is called exposure bias.

Despite this limitation, teacher forcing remains common because it is simple, efficient, and stable.

Minimal Recurrent Encoder-Decoder in PyTorch

A simple recurrent encoder-decoder model can be written with nn.GRU.

import torch
import torch.nn as nn

class Seq2SeqGRU(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, emb_dim, hidden_dim):
        super().__init__()

        self.src_embedding = nn.Embedding(src_vocab_size, emb_dim)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, emb_dim)

        self.encoder = nn.GRU(
            input_size=emb_dim,
            hidden_size=hidden_dim,
            batch_first=True,
        )

        self.decoder = nn.GRU(
            input_size=emb_dim,
            hidden_size=hidden_dim,
            batch_first=True,
        )

        self.output = nn.Linear(hidden_dim, tgt_vocab_size)

    def forward(self, src_tokens, tgt_input_tokens):
        src_emb = self.src_embedding(src_tokens)

        _, encoder_state = self.encoder(src_emb)

        tgt_emb = self.tgt_embedding(tgt_input_tokens)

        decoder_states, _ = self.decoder(tgt_emb, encoder_state)

        logits = self.output(decoder_states)

        return logits

The source tokens have shape

[B, S]

The target input tokens have shape

[B, T]

The output logits have shape

[B, T, V_tgt]

This model uses the final encoder hidden state as the initial decoder hidden state.

The Fixed-Vector Bottleneck

The simple recurrent encoder-decoder compresses the entire input sequence into one vector. This creates a bottleneck.

For short inputs, a single vector may be sufficient. For long inputs, the final hidden state may lose information about early tokens. This is especially problematic in translation, summarization, and speech recognition.

The bottleneck can be described as follows:

(x_1, \ldots, x_S) \longrightarrow h_S \longrightarrow (y_1, \ldots, y_T).

All information must pass through $h_S$ . If $S$ is large, this representation may become overloaded.

Attention was introduced to solve this problem. Instead of forcing the decoder to rely on one vector, attention allows the decoder to inspect all encoder states:

H = (h_1, h_2, \ldots, h_S).

At each output step, the decoder chooses which encoder states are most relevant.

Encoder-Decoder with Attention

With attention, the encoder produces a sequence of states:

h_1, h_2, \ldots, h_S.

At decoder step $t$ , the decoder has state $s_t$ . It computes a score between $s_t$ and each encoder state $h_i$ :

e_{t,i} = \text{score}(s_t, h_i).

These scores are normalized into attention weights:

\alpha_{t,i} = \frac{\exp(e_{t,i})} {\sum_{j=1}^{S}\exp(e_{t,j})}.

The context vector is a weighted sum of encoder states:

c_t = \sum_{i=1}^{S} \alpha_{t,i} h_i.

The decoder then predicts the next token using both $s_t$ and $c_t$ .

Attention changes the architecture from fixed compression to dynamic retrieval. At each generation step, the decoder retrieves the source information it needs.

Transformer Encoder-Decoder Models

Modern sequence-to-sequence systems often use transformers instead of RNNs.

A transformer encoder maps input token embeddings to contextual representations:

X \in \mathbb{R}^{B \times S \times D} \longmapsto H \in \mathbb{R}^{B \times S \times D}.

A transformer decoder generates target representations using two attention mechanisms.

First, masked self-attention lets each target token attend to earlier target tokens.

Second, cross-attention lets target tokens attend to encoder outputs.

The decoder computes

Y_{\text{in}} \longmapsto Z,

where

Z \in \mathbb{R}^{B \times T \times D}.

A final linear layer maps decoder states to vocabulary logits:

\text{logits} \in \mathbb{R}^{B \times T \times V}.

Transformer encoder-decoder models are used in machine translation, summarization, speech-to-text, text-to-text learning, and multimodal generation.

Causal Masking in the Decoder

The decoder must not see future target tokens during training. Otherwise, it could cheat by looking at the answer.

For target positions $1,\ldots,T$ , position $t$ may attend only to positions $1,\ldots,t$ . This is enforced using a causal mask.

The causal mask has the form

M_{ij} = \begin{cases} 0, & j \leq i, \\ -\infty, & j > i. \end{cases}

The mask is added to attention scores before softmax. Future positions receive probability zero after softmax.

In PyTorch:

def causal_mask(T, device):
    mask = torch.triu(torch.ones(T, T, device=device), diagonal=1)
    mask = mask.masked_fill(mask == 1, float("-inf"))
    return mask

This mask prevents information leakage from future tokens.

Minimal Transformer Encoder-Decoder in PyTorch

PyTorch provides nn.Transformer as a reference implementation.

import torch
import torch.nn as nn

class Seq2SeqTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_layers):
        super().__init__()

        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)

        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            batch_first=True,
        )

        self.output = nn.Linear(d_model, tgt_vocab_size)

    def forward(self, src_tokens, tgt_input_tokens, tgt_mask=None):
        src = self.src_embedding(src_tokens)
        tgt = self.tgt_embedding(tgt_input_tokens)

        hidden = self.transformer(
            src=src,
            tgt=tgt,
            tgt_mask=tgt_mask,
        )

        logits = self.output(hidden)
        return logits

Example use:

B = 16
S = 30
T = 20
src_vocab_size = 32000
tgt_vocab_size = 32000

model = Seq2SeqTransformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=512,
    nhead=8,
    num_layers=6,
)

src_tokens = torch.randint(0, src_vocab_size, (B, S))
tgt_input_tokens = torch.randint(0, tgt_vocab_size, (B, T))

mask = causal_mask(T, src_tokens.device)

logits = model(src_tokens, tgt_input_tokens, tgt_mask=mask)

print(logits.shape)  # torch.Size([16, 20, 32000])

This example omits positional encodings and padding masks, both of which are needed in a complete implementation.

Padding Masks

Batches contain sequences of different lengths. To store them in a tensor, shorter sequences are padded.

For example:

[12, 90, 44, 8, 2]
[51, 17, 2, 0, 0]
[33, 74, 19, 41, 2]

Here 0 may be the padding token, and 2 may be the end token.

The model should ignore padding tokens. A padding mask marks which positions are padding.

In PyTorch, a source padding mask usually has shape

[B, S]

where True indicates a padded position for nn.Transformer.

src_key_padding_mask = src_tokens == pad_id
tgt_key_padding_mask = tgt_input_tokens == pad_id

These masks prevent attention from treating padding as meaningful content.

Encoder-Decoder Versus Decoder-Only Models

Encoder-decoder models and decoder-only models both generate sequences, but their conditioning structure differs.

An encoder-decoder model explicitly separates source encoding from target generation. This is natural when the input and output have distinct roles, as in translation or summarization.

A decoder-only model concatenates the input and output into one sequence and predicts tokens autoregressively. Many large language models use this design.

For example, a decoder-only model may receive:

Translate English to French:
I like cats.
French:

and then continue generating the answer.

The encoder-decoder design gives the source sequence bidirectional context through the encoder. The decoder-only design uses one unified causal sequence. Encoder-decoder models are often efficient for text-to-text tasks with long inputs and shorter outputs. Decoder-only models are simpler and scale well for general-purpose language modeling.

Summary

An encoder-decoder model maps an input sequence to an output sequence. The encoder builds a representation of the input. The decoder generates the output one token at a time.

The basic probabilistic form is

p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x).

Early models used recurrent networks and compressed the source into a fixed-length vector. Attention removed this bottleneck by letting the decoder access all encoder states. Transformer encoder-decoder models use self-attention in the encoder, masked self-attention in the decoder, and cross-attention between the decoder and encoder outputs.

In PyTorch, sequence-to-sequence models are usually trained with teacher forcing, cross-entropy loss over target tokens, causal masks for decoder self-attention, and padding masks for variable-length batches.