A sequence-to-sequence model maps one sequence to another sequence. The input and output may have different lengths. This setting appears in machine translation, summarization, speech recognition, dialogue, code generation, and many other tasks.
A standard supervised sequence-to-sequence problem has an input sequence
and an output sequence
The input length and output length need not be equal. In translation, an English sentence with 12 tokens may correspond to a French sentence with 15 tokens. In summarization, a long document may map to a short paragraph. In speech recognition, thousands of audio frames may map to a short text sequence.
The central problem is to model the conditional distribution
A sequence-to-sequence model usually factorizes this distribution autoregressively:
This equation says that the model predicts the next output token using the input sequence and the output tokens already generated.
The Encoder-Decoder Pattern
The encoder-decoder architecture separates the model into two parts.
The encoder reads the input sequence and produces a representation. The decoder reads that representation and generates the output sequence.
The encoder performs the mapping
where is a learned representation of the input. The decoder then performs
In early recurrent sequence-to-sequence models, was often a single fixed-length vector. This vector was expected to contain all information needed to generate the output. In modern architectures, is usually a sequence of hidden states rather than one vector.
For example, an encoder may produce
where each represents input token in context. The decoder can then attend to these states while producing each output token.
Why Use an Encoder and a Decoder
Many prediction problems can be handled by a simple feedforward classifier. Sequence-to-sequence problems require more structure.
First, the input has variable length. A model must accept sequences with different numbers of tokens or frames.
Second, the output has variable length. The model must decide when to stop generating.
Third, the output is structured. The token at position depends on earlier output tokens. For example, in translation, word order, agreement, and phrase structure create dependencies across the whole output.
The encoder-decoder design handles these problems by giving each part a clear role. The encoder builds a representation of the source sequence. The decoder converts that representation into a target sequence, one token at a time.
Recurrent Encoder-Decoder Models
The original encoder-decoder architecture used recurrent neural networks.
Let the input tokens be embedded as vectors
An encoder RNN processes them in order:
After reading the whole input sequence, the final hidden state becomes a summary of the input.
The decoder is another RNN. It generates one output token at each step:
where is the decoder hidden state and is the embedding of the previous output token .
The decoder then produces logits over the output vocabulary:
The probability of the next token is obtained with softmax:
Here is the vocabulary size.
Start and End Tokens
The decoder needs a first input before it has generated any output token. For this reason, sequence-to-sequence models use a special start token, often written as <bos> or <sos>.
Generation begins with
At each step, the decoder predicts the next token. Generation stops when the decoder emits a special end token, often written as <eos>.
For example, a target sequence
I like catsmay be represented during training as
<bos> I like cats <eos>The decoder input is
<bos> I like catsand the decoder target is
I like cats <eos>This offset is central to autoregressive training.
Training Objective
During training, the model is given the source sequence and the correct target sequence. The objective is to maximize the conditional likelihood of the target sequence:
Equivalently, we minimize the negative log-likelihood:
For classification over a vocabulary, this is the cross-entropy loss applied at every output position.
In PyTorch, the logits usually have shape
[B, T, V]where is batch size, is target sequence length, and is vocabulary size.
The target token IDs have shape
[B, T]A typical loss computation reshapes the tensors:
import torch
import torch.nn.functional as F
B, T, V = 32, 20, 50000
logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))
loss = F.cross_entropy(
logits.reshape(B * T, V),
targets.reshape(B * T),
)This treats each target position as a vocabulary classification problem.
Teacher Forcing
During training, the decoder is usually given the correct previous token, not its own previous prediction. This method is called teacher forcing.
At training time, the decoder receives
and learns to predict
At inference time, the correct target sequence is unknown. The decoder must feed back its own previous prediction.
This creates a gap between training and inference. During training, the model sees clean prefixes. During inference, a wrong prediction can corrupt future predictions. This problem is called exposure bias.
Despite this limitation, teacher forcing remains common because it is simple, efficient, and stable.
Minimal Recurrent Encoder-Decoder in PyTorch
A simple recurrent encoder-decoder model can be written with nn.GRU.
import torch
import torch.nn as nn
class Seq2SeqGRU(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, emb_dim, hidden_dim):
super().__init__()
self.src_embedding = nn.Embedding(src_vocab_size, emb_dim)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, emb_dim)
self.encoder = nn.GRU(
input_size=emb_dim,
hidden_size=hidden_dim,
batch_first=True,
)
self.decoder = nn.GRU(
input_size=emb_dim,
hidden_size=hidden_dim,
batch_first=True,
)
self.output = nn.Linear(hidden_dim, tgt_vocab_size)
def forward(self, src_tokens, tgt_input_tokens):
src_emb = self.src_embedding(src_tokens)
_, encoder_state = self.encoder(src_emb)
tgt_emb = self.tgt_embedding(tgt_input_tokens)
decoder_states, _ = self.decoder(tgt_emb, encoder_state)
logits = self.output(decoder_states)
return logitsThe source tokens have shape
[B, S]The target input tokens have shape
[B, T]The output logits have shape
[B, T, V_tgt]This model uses the final encoder hidden state as the initial decoder hidden state.
The Fixed-Vector Bottleneck
The simple recurrent encoder-decoder compresses the entire input sequence into one vector. This creates a bottleneck.
For short inputs, a single vector may be sufficient. For long inputs, the final hidden state may lose information about early tokens. This is especially problematic in translation, summarization, and speech recognition.
The bottleneck can be described as follows:
All information must pass through . If is large, this representation may become overloaded.
Attention was introduced to solve this problem. Instead of forcing the decoder to rely on one vector, attention allows the decoder to inspect all encoder states:
At each output step, the decoder chooses which encoder states are most relevant.
Encoder-Decoder with Attention
With attention, the encoder produces a sequence of states:
At decoder step , the decoder has state . It computes a score between and each encoder state :
These scores are normalized into attention weights:
The context vector is a weighted sum of encoder states:
The decoder then predicts the next token using both and .
Attention changes the architecture from fixed compression to dynamic retrieval. At each generation step, the decoder retrieves the source information it needs.
Transformer Encoder-Decoder Models
Modern sequence-to-sequence systems often use transformers instead of RNNs.
A transformer encoder maps input token embeddings to contextual representations:
A transformer decoder generates target representations using two attention mechanisms.
First, masked self-attention lets each target token attend to earlier target tokens.
Second, cross-attention lets target tokens attend to encoder outputs.
The decoder computes
where
A final linear layer maps decoder states to vocabulary logits:
Transformer encoder-decoder models are used in machine translation, summarization, speech-to-text, text-to-text learning, and multimodal generation.
Causal Masking in the Decoder
The decoder must not see future target tokens during training. Otherwise, it could cheat by looking at the answer.
For target positions , position may attend only to positions . This is enforced using a causal mask.
The causal mask has the form
The mask is added to attention scores before softmax. Future positions receive probability zero after softmax.
In PyTorch:
def causal_mask(T, device):
mask = torch.triu(torch.ones(T, T, device=device), diagonal=1)
mask = mask.masked_fill(mask == 1, float("-inf"))
return maskThis mask prevents information leakage from future tokens.
Minimal Transformer Encoder-Decoder in PyTorch
PyTorch provides nn.Transformer as a reference implementation.
import torch
import torch.nn as nn
class Seq2SeqTransformer(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_layers):
super().__init__()
self.src_embedding = nn.Embedding(src_vocab_size, d_model)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.transformer = nn.Transformer(
d_model=d_model,
nhead=nhead,
num_encoder_layers=num_layers,
num_decoder_layers=num_layers,
batch_first=True,
)
self.output = nn.Linear(d_model, tgt_vocab_size)
def forward(self, src_tokens, tgt_input_tokens, tgt_mask=None):
src = self.src_embedding(src_tokens)
tgt = self.tgt_embedding(tgt_input_tokens)
hidden = self.transformer(
src=src,
tgt=tgt,
tgt_mask=tgt_mask,
)
logits = self.output(hidden)
return logitsExample use:
B = 16
S = 30
T = 20
src_vocab_size = 32000
tgt_vocab_size = 32000
model = Seq2SeqTransformer(
src_vocab_size=src_vocab_size,
tgt_vocab_size=tgt_vocab_size,
d_model=512,
nhead=8,
num_layers=6,
)
src_tokens = torch.randint(0, src_vocab_size, (B, S))
tgt_input_tokens = torch.randint(0, tgt_vocab_size, (B, T))
mask = causal_mask(T, src_tokens.device)
logits = model(src_tokens, tgt_input_tokens, tgt_mask=mask)
print(logits.shape) # torch.Size([16, 20, 32000])This example omits positional encodings and padding masks, both of which are needed in a complete implementation.
Padding Masks
Batches contain sequences of different lengths. To store them in a tensor, shorter sequences are padded.
For example:
[12, 90, 44, 8, 2]
[51, 17, 2, 0, 0]
[33, 74, 19, 41, 2]Here 0 may be the padding token, and 2 may be the end token.
The model should ignore padding tokens. A padding mask marks which positions are padding.
In PyTorch, a source padding mask usually has shape
[B, S]where True indicates a padded position for nn.Transformer.
src_key_padding_mask = src_tokens == pad_id
tgt_key_padding_mask = tgt_input_tokens == pad_idThese masks prevent attention from treating padding as meaningful content.
Encoder-Decoder Versus Decoder-Only Models
Encoder-decoder models and decoder-only models both generate sequences, but their conditioning structure differs.
An encoder-decoder model explicitly separates source encoding from target generation. This is natural when the input and output have distinct roles, as in translation or summarization.
A decoder-only model concatenates the input and output into one sequence and predicts tokens autoregressively. Many large language models use this design.
For example, a decoder-only model may receive:
Translate English to French:
I like cats.
French:and then continue generating the answer.
The encoder-decoder design gives the source sequence bidirectional context through the encoder. The decoder-only design uses one unified causal sequence. Encoder-decoder models are often efficient for text-to-text tasks with long inputs and shorter outputs. Decoder-only models are simpler and scale well for general-purpose language modeling.
Summary
An encoder-decoder model maps an input sequence to an output sequence. The encoder builds a representation of the input. The decoder generates the output one token at a time.
The basic probabilistic form is
Early models used recurrent networks and compressed the source into a fixed-length vector. Attention removed this bottleneck by letting the decoder access all encoder states. Transformer encoder-decoder models use self-attention in the encoder, masked self-attention in the decoder, and cross-attention between the decoder and encoder outputs.
In PyTorch, sequence-to-sequence models are usually trained with teacher forcing, cross-entropy loss over target tokens, causal masks for decoder self-attention, and padding masks for variable-length batches.