Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.
Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.
For example:
| Source language | Target language |
|---|---|
the cat sat on the mat | le chat s'est assis sur le tapis |
good morning | buenos días |
where is the station? | 駅はどこですか |
Modern neural machine translation systems are usually sequence-to-sequence models built with transformers.
The core problem is conditional sequence modeling. Given a source sequence
the model predicts a target sequence
The goal is to maximize:
The translation model predicts the target tokens one step at a time:
The next target token depends on both the source sentence and the previously generated target tokens.
Sequence-to-Sequence Learning
Neural machine translation is usually formulated as sequence-to-sequence learning.
A sequence-to-sequence model contains:
| Component | Purpose |
|---|---|
| Encoder | Reads the source sequence |
| Decoder | Generates the target sequence |
| Attention | Connects decoder states to source representations |
The encoder transforms the source sequence into contextual representations:
The decoder then generates target tokens autoregressively.
At decoding step , the decoder predicts:
This structure became dominant because it can handle variable-length inputs and outputs. Earlier phrase-based statistical systems relied heavily on hand-engineered alignment rules and feature systems. Neural sequence models learn translation behavior directly from data.
Tokenization in Translation Systems
Both source and target languages are tokenized before training.
Suppose:
English: the cat sat
French: le chat s'est assisThe tokenizer converts text into token IDs.
Source:
[12, 85, 901]Target:
[44, 219, 777, 602]Translation systems usually use subword tokenization because vocabularies must handle many languages, names, compounds, and rare words.
A multilingual tokenizer may share subword units across languages. This allows the model to reuse representations between related words and scripts.
For example:
international
internacional
internationalemay partially share subword pieces.
Encoder Representations
The encoder processes the source tokens and produces contextual vectors.
If the source token tensor has shape:
[B, S]then the encoder output usually has shape:
[B, S, D]where:
| Symbol | Meaning |
|---|---|
| Batch size | |
| Source sequence length | |
| Hidden dimension |
Each source token receives a contextual representation. In a transformer encoder, self-attention allows every source token to interact with every other source token.
For example, the word bank in:
the boat reached the bankreceives a different contextual representation from bank in:
she deposited money at the bankbecause the surrounding context changes the hidden state.
Decoder Generation
The decoder generates one target token at a time.
At decoding step , the decoder receives:
| Input | Description |
|---|---|
| Previous target tokens | |
| Encoder representations | Source hidden states |
| Positional information | Token positions |
The decoder predicts logits over the target vocabulary:
A softmax converts logits into probabilities:
The highest-probability token may then be selected or sampled.
Teacher Forcing
During training, the decoder usually receives the true previous target tokens rather than its own predictions. This is called teacher forcing.
Suppose the target sentence is:
le chat dortTraining proceeds as:
| Decoder input | Prediction target |
|---|---|
<bos> | le |
<bos> le | chat |
<bos> le chat | dort |
<bos> le chat dort | <eos> |
Teacher forcing stabilizes training because the decoder always receives correct history tokens. Without it, prediction errors early in the sequence could quickly corrupt later predictions.
However, teacher forcing creates a mismatch between training and inference. During inference, the model must condition on its own generated outputs.
Attention Mechanisms
Attention allows the decoder to focus on relevant source positions while generating each target token.
Suppose the source sentence is:
the black cat sleepsWhen generating the French word:
chatthe decoder should focus strongly on cat.
Attention computes alignment scores between the decoder state and encoder states.
If:
is the decoder query vector at step , and:
are encoder key vectors, then attention scores are computed as:
The decoder then forms a weighted combination of encoder value vectors:
The context vector summarizes the relevant source information for the current decoding step.
Transformer Translation Models
Modern translation systems usually use transformer architectures.
A transformer translation model contains:
| Component | Description |
|---|---|
| Encoder self-attention | Source tokens attend to source tokens |
| Decoder self-attention | Target tokens attend to earlier target tokens |
| Cross-attention | Decoder attends to encoder outputs |
The decoder uses a causal mask so that future target tokens remain hidden during training.
The overall architecture is:
source tokens
-> encoder
-> encoder hidden states
-> decoder with cross-attention
-> target logitsTransformers replaced recurrent translation systems because they parallelize efficiently and model long-range dependencies better.
A Minimal Seq2Seq Transformer Interface
A translation model often receives:
source_ids: [B, S]
source_mask: [B, S]
target_ids: [B, T]
target_mask: [B, T]The model outputs:
logits: [B, T, V]where is the target vocabulary size.
A simplified PyTorch-style interface:
class TranslationModel(nn.Module):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def forward(
self,
source_ids,
source_mask,
target_ids,
target_mask,
):
encoder_hidden = self.encoder(
input_ids=source_ids,
attention_mask=source_mask,
)
logits = self.decoder(
input_ids=target_ids,
attention_mask=target_mask,
encoder_hidden_states=encoder_hidden,
encoder_attention_mask=source_mask,
)
return logitsThe encoder processes the source sentence once. The decoder repeatedly attends to those encoder representations while generating the translation.
Training Objective
Translation models are usually trained with token-level cross-entropy loss.
Suppose:
logits: [B, T, V]
targets: [B, T]We flatten the batch and sequence dimensions:
loss_fn = nn.CrossEntropyLoss(ignore_index=0)
B, T, V = logits.shape
loss = loss_fn(
logits.reshape(B * T, V),
targets.reshape(B * T),
)The target tensor contains the expected next token at each position.
Padding tokens are ignored.
Greedy Decoding
During inference, the decoder generates tokens autoregressively.
The simplest decoding strategy is greedy decoding.
At each step:
The model selects the highest-probability next token.
Example:
| Step | Output |
|---|---|
<bos> | le |
le | chat |
le chat | dort |
le chat dort | <eos> |
Greedy decoding is simple and fast, but it may produce suboptimal translations because it commits to local decisions too early.
Beam Search
Beam search keeps several candidate translations instead of only one.
At each step:
- Expand each candidate with possible next tokens.
- Compute cumulative log probabilities.
- Keep the top sequences.
The parameter is called the beam width.
Beam search approximates:
It usually improves translation quality compared with greedy decoding.
However, very large beam sizes can reduce diversity and sometimes produce repetitive or overly generic translations.
Length Normalization
Longer sequences tend to have lower cumulative probabilities because probabilities are multiplied across many steps.
Without correction, beam search may prefer short outputs.
Length normalization adjusts sequence scores:
The parameter controls the strength of normalization.
This helps prevent beam search from terminating too early.
Exposure Bias
Teacher forcing creates a mismatch between training and inference.
During training:
decoder input = gold tokensDuring inference:
decoder input = model predictionsThis mismatch is called exposure bias.
A mistake during inference can push the model into contexts it never saw during training.
Several methods attempt to reduce exposure bias:
| Method | Idea |
|---|---|
| Scheduled sampling | Occasionally feed model predictions during training |
| Sequence-level training | Optimize full generated sequences |
| Reinforcement learning | Use task-level rewards |
| Data augmentation | Expose model to noisy histories |
In practice, standard teacher forcing with large datasets and transformers often works surprisingly well.
Evaluation Metrics
Translation quality is difficult to measure automatically because many valid translations exist.
Suppose the reference translation is:
the cat is sleepingThe model output:
the cat sleepsmay still be correct.
The most common automatic metric is BLEU.
BLEU measures overlap between generated and reference n-grams while penalizing overly short outputs.
Other metrics include:
| Metric | Main idea |
|---|---|
| BLEU | N-gram overlap |
| ROUGE | Recall-oriented overlap |
| METEOR | Flexible matching with stemming and synonyms |
| chrF | Character-level overlap |
| COMET | Learned neural evaluation metric |
| BLEURT | Learned semantic similarity metric |
Modern evaluation increasingly uses learned metrics because simple n-gram overlap correlates imperfectly with human judgment.
Multilingual Translation
Multilingual translation systems train on many language pairs simultaneously.
A single model may support:
English -> French
English -> German
French -> English
German -> Spanishand many more.
A common strategy prepends a language-control token:
<fr> the cat sleepsThe decoder then generates French text.
Multilingual systems can transfer knowledge across languages. High-resource languages may improve low-resource languages through shared representations.
However, multilingual systems also face challenges:
| Challenge | Description |
|---|---|
| Vocabulary imbalance | Some languages receive many more tokens |
| Script diversity | Different alphabets and writing systems |
| Data imbalance | High-resource languages dominate training |
| Long-tail morphology | Complex word structures |
| Token efficiency | Some languages require more tokens |
Alignment and Attention Maps
Translation attention patterns often reveal approximate word alignments.
For example:
| Source | Attention target |
|---|---|
cat | chat |
black | noir |
sleeps | dort |
Attention matrices can sometimes be visualized as heatmaps.
However, transformer attention is not always a reliable explanation mechanism. Multiple attention heads interact, and the model’s behavior depends on deeper nonlinear computations.
Still, attention maps are often useful debugging tools.
Hallucinations and Translation Errors
Translation systems may produce:
| Error type | Example |
|---|---|
| Omission | Missing information |
| Addition | Hallucinated content |
| Reordering error | Incorrect syntax |
| Agreement error | Wrong grammatical agreement |
| Named entity corruption | Wrong person or location names |
| Literal translation | Grammatically unnatural output |
Hallucinations are especially dangerous in medical, legal, or financial translation systems.
A translation model may generate fluent but incorrect text. Fluency does not guarantee factual accuracy.
Low-Resource Translation
Some language pairs have limited parallel training data.
Common strategies include:
| Method | Idea |
|---|---|
| Multilingual pretraining | Share knowledge across languages |
| Back-translation | Generate synthetic parallel data |
| Transfer learning | Fine-tune from related languages |
| Unsupervised translation | Use monolingual corpora only |
| Retrieval augmentation | Retrieve similar bilingual examples |
Back-translation is particularly important. A reverse translation model generates synthetic source sentences from target-language monolingual data, increasing training data size.
Translation with Large Language Models
Large language models can perform translation without task-specific fine-tuning.
For example:
Translate to German:
The weather is beautiful today.The model generates:
Das Wetter ist heute schön.Instruction tuning and multilingual pretraining allow general-purpose models to perform translation as one capability among many.
However, specialized translation systems may still outperform general LLMs on domain accuracy, terminology consistency, and latency.
Summary
Machine translation maps source sequences to target sequences. Modern systems are usually transformer-based sequence-to-sequence models with encoder-decoder architectures and attention mechanisms.
Tokenization converts text into subword units. The encoder computes contextual source representations. The decoder autoregressively predicts target tokens while attending to the source sequence.
Training usually uses teacher forcing and cross-entropy loss. Inference uses greedy decoding or beam search. Evaluation relies on metrics such as BLEU and learned semantic scoring systems. Multilingual and large language models extend translation to many languages and tasks within a unified architecture.