Skip to content

Machine Translation

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.

For example:

Source languageTarget language
the cat sat on the matle chat s'est assis sur le tapis
good morningbuenos días
where is the station?駅はどこですか

Modern neural machine translation systems are usually sequence-to-sequence models built with transformers.

The core problem is conditional sequence modeling. Given a source sequence

x=(x1,x2,,xS), x = (x_1, x_2, \dots, x_S),

the model predicts a target sequence

y=(y1,y2,,yT). y = (y_1, y_2, \dots, y_T).

The goal is to maximize:

P(yx). P(y \mid x).

The translation model predicts the target tokens one step at a time:

P(yx)=t=1TP(yty<t,x). P(y \mid x) = \prod_{t=1}^{T} P(y_t \mid y_{<t}, x).

The next target token depends on both the source sentence and the previously generated target tokens.

Sequence-to-Sequence Learning

Neural machine translation is usually formulated as sequence-to-sequence learning.

A sequence-to-sequence model contains:

ComponentPurpose
EncoderReads the source sequence
DecoderGenerates the target sequence
AttentionConnects decoder states to source representations

The encoder transforms the source sequence into contextual representations:

H=(h1,h2,,hS). H = (h_1, h_2, \dots, h_S).

The decoder then generates target tokens autoregressively.

At decoding step tt, the decoder predicts:

P(yty<t,H). P(y_t \mid y_{<t}, H).

This structure became dominant because it can handle variable-length inputs and outputs. Earlier phrase-based statistical systems relied heavily on hand-engineered alignment rules and feature systems. Neural sequence models learn translation behavior directly from data.

Tokenization in Translation Systems

Both source and target languages are tokenized before training.

Suppose:

English:  the cat sat
French:   le chat s'est assis

The tokenizer converts text into token IDs.

Source:

[12, 85, 901]

Target:

[44, 219, 777, 602]

Translation systems usually use subword tokenization because vocabularies must handle many languages, names, compounds, and rare words.

A multilingual tokenizer may share subword units across languages. This allows the model to reuse representations between related words and scripts.

For example:

international
internacional
internationale

may partially share subword pieces.

Encoder Representations

The encoder processes the source tokens and produces contextual vectors.

If the source token tensor has shape:

[B, S]

then the encoder output usually has shape:

[B, S, D]

where:

SymbolMeaning
BBBatch size
SSSource sequence length
DDHidden dimension

Each source token receives a contextual representation. In a transformer encoder, self-attention allows every source token to interact with every other source token.

For example, the word bank in:

the boat reached the bank

receives a different contextual representation from bank in:

she deposited money at the bank

because the surrounding context changes the hidden state.

Decoder Generation

The decoder generates one target token at a time.

At decoding step tt, the decoder receives:

InputDescription
Previous target tokensy1,,yt1y_1, \dots, y_{t-1}
Encoder representationsSource hidden states
Positional informationToken positions

The decoder predicts logits over the target vocabulary:

ztRV. z_t \in \mathbb{R}^{|V|}.

A softmax converts logits into probabilities:

P(yty<t,x)=softmax(zt). P(y_t \mid y_{<t}, x) = \operatorname{softmax}(z_t).

The highest-probability token may then be selected or sampled.

Teacher Forcing

During training, the decoder usually receives the true previous target tokens rather than its own predictions. This is called teacher forcing.

Suppose the target sentence is:

le chat dort

Training proceeds as:

Decoder inputPrediction target
<bos>le
<bos> lechat
<bos> le chatdort
<bos> le chat dort<eos>

Teacher forcing stabilizes training because the decoder always receives correct history tokens. Without it, prediction errors early in the sequence could quickly corrupt later predictions.

However, teacher forcing creates a mismatch between training and inference. During inference, the model must condition on its own generated outputs.

Attention Mechanisms

Attention allows the decoder to focus on relevant source positions while generating each target token.

Suppose the source sentence is:

the black cat sleeps

When generating the French word:

chat

the decoder should focus strongly on cat.

Attention computes alignment scores between the decoder state and encoder states.

If:

qt q_t

is the decoder query vector at step tt, and:

K=(k1,,kS) K = (k_1, \dots, k_S)

are encoder key vectors, then attention scores are computed as:

αti=softmax(qtki). \alpha_{ti} = \operatorname{softmax}(q_t^\top k_i).

The decoder then forms a weighted combination of encoder value vectors:

ct=i=1Sαtivi. c_t = \sum_{i=1}^{S} \alpha_{ti} v_i.

The context vector ctc_t summarizes the relevant source information for the current decoding step.

Transformer Translation Models

Modern translation systems usually use transformer architectures.

A transformer translation model contains:

ComponentDescription
Encoder self-attentionSource tokens attend to source tokens
Decoder self-attentionTarget tokens attend to earlier target tokens
Cross-attentionDecoder attends to encoder outputs

The decoder uses a causal mask so that future target tokens remain hidden during training.

The overall architecture is:

source tokens
-> encoder
-> encoder hidden states
-> decoder with cross-attention
-> target logits

Transformers replaced recurrent translation systems because they parallelize efficiently and model long-range dependencies better.

A Minimal Seq2Seq Transformer Interface

A translation model often receives:

source_ids: [B, S]
source_mask: [B, S]

target_ids: [B, T]
target_mask: [B, T]

The model outputs:

logits: [B, T, V]

where VV is the target vocabulary size.

A simplified PyTorch-style interface:

class TranslationModel(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(
        self,
        source_ids,
        source_mask,
        target_ids,
        target_mask,
    ):
        encoder_hidden = self.encoder(
            input_ids=source_ids,
            attention_mask=source_mask,
        )

        logits = self.decoder(
            input_ids=target_ids,
            attention_mask=target_mask,
            encoder_hidden_states=encoder_hidden,
            encoder_attention_mask=source_mask,
        )

        return logits

The encoder processes the source sentence once. The decoder repeatedly attends to those encoder representations while generating the translation.

Training Objective

Translation models are usually trained with token-level cross-entropy loss.

Suppose:

logits: [B, T, V]
targets: [B, T]

We flatten the batch and sequence dimensions:

loss_fn = nn.CrossEntropyLoss(ignore_index=0)

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The target tensor contains the expected next token at each position.

Padding tokens are ignored.

Greedy Decoding

During inference, the decoder generates tokens autoregressively.

The simplest decoding strategy is greedy decoding.

At each step:

yt=argmaxkP(ky<t,x). y_t = \arg\max_k P(k \mid y_{<t}, x).

The model selects the highest-probability next token.

Example:

StepOutput
<bos>le
lechat
le chatdort
le chat dort<eos>

Greedy decoding is simple and fast, but it may produce suboptimal translations because it commits to local decisions too early.

Beam Search

Beam search keeps several candidate translations instead of only one.

At each step:

  1. Expand each candidate with possible next tokens.
  2. Compute cumulative log probabilities.
  3. Keep the top kk sequences.

The parameter kk is called the beam width.

Beam search approximates:

argmaxyP(yx). \arg\max_y P(y \mid x).

It usually improves translation quality compared with greedy decoding.

However, very large beam sizes can reduce diversity and sometimes produce repetitive or overly generic translations.

Length Normalization

Longer sequences tend to have lower cumulative probabilities because probabilities are multiplied across many steps.

Without correction, beam search may prefer short outputs.

Length normalization adjusts sequence scores:

1Tαt=1TlogP(yty<t,x). \frac{1}{T^\alpha} \sum_{t=1}^{T} \log P(y_t \mid y_{<t}, x).

The parameter α\alpha controls the strength of normalization.

This helps prevent beam search from terminating too early.

Exposure Bias

Teacher forcing creates a mismatch between training and inference.

During training:

decoder input = gold tokens

During inference:

decoder input = model predictions

This mismatch is called exposure bias.

A mistake during inference can push the model into contexts it never saw during training.

Several methods attempt to reduce exposure bias:

MethodIdea
Scheduled samplingOccasionally feed model predictions during training
Sequence-level trainingOptimize full generated sequences
Reinforcement learningUse task-level rewards
Data augmentationExpose model to noisy histories

In practice, standard teacher forcing with large datasets and transformers often works surprisingly well.

Evaluation Metrics

Translation quality is difficult to measure automatically because many valid translations exist.

Suppose the reference translation is:

the cat is sleeping

The model output:

the cat sleeps

may still be correct.

The most common automatic metric is BLEU.

BLEU measures overlap between generated and reference n-grams while penalizing overly short outputs.

Other metrics include:

MetricMain idea
BLEUN-gram overlap
ROUGERecall-oriented overlap
METEORFlexible matching with stemming and synonyms
chrFCharacter-level overlap
COMETLearned neural evaluation metric
BLEURTLearned semantic similarity metric

Modern evaluation increasingly uses learned metrics because simple n-gram overlap correlates imperfectly with human judgment.

Multilingual Translation

Multilingual translation systems train on many language pairs simultaneously.

A single model may support:

English -> French
English -> German
French -> English
German -> Spanish

and many more.

A common strategy prepends a language-control token:

<fr> the cat sleeps

The decoder then generates French text.

Multilingual systems can transfer knowledge across languages. High-resource languages may improve low-resource languages through shared representations.

However, multilingual systems also face challenges:

ChallengeDescription
Vocabulary imbalanceSome languages receive many more tokens
Script diversityDifferent alphabets and writing systems
Data imbalanceHigh-resource languages dominate training
Long-tail morphologyComplex word structures
Token efficiencySome languages require more tokens

Alignment and Attention Maps

Translation attention patterns often reveal approximate word alignments.

For example:

SourceAttention target
catchat
blacknoir
sleepsdort

Attention matrices can sometimes be visualized as heatmaps.

However, transformer attention is not always a reliable explanation mechanism. Multiple attention heads interact, and the model’s behavior depends on deeper nonlinear computations.

Still, attention maps are often useful debugging tools.

Hallucinations and Translation Errors

Translation systems may produce:

Error typeExample
OmissionMissing information
AdditionHallucinated content
Reordering errorIncorrect syntax
Agreement errorWrong grammatical agreement
Named entity corruptionWrong person or location names
Literal translationGrammatically unnatural output

Hallucinations are especially dangerous in medical, legal, or financial translation systems.

A translation model may generate fluent but incorrect text. Fluency does not guarantee factual accuracy.

Low-Resource Translation

Some language pairs have limited parallel training data.

Common strategies include:

MethodIdea
Multilingual pretrainingShare knowledge across languages
Back-translationGenerate synthetic parallel data
Transfer learningFine-tune from related languages
Unsupervised translationUse monolingual corpora only
Retrieval augmentationRetrieve similar bilingual examples

Back-translation is particularly important. A reverse translation model generates synthetic source sentences from target-language monolingual data, increasing training data size.

Translation with Large Language Models

Large language models can perform translation without task-specific fine-tuning.

For example:

Translate to German:
The weather is beautiful today.

The model generates:

Das Wetter ist heute schön.

Instruction tuning and multilingual pretraining allow general-purpose models to perform translation as one capability among many.

However, specialized translation systems may still outperform general LLMs on domain accuracy, terminology consistency, and latency.

Summary

Machine translation maps source sequences to target sequences. Modern systems are usually transformer-based sequence-to-sequence models with encoder-decoder architectures and attention mechanisms.

Tokenization converts text into subword units. The encoder computes contextual source representations. The decoder autoregressively predicts target tokens while attending to the source sequence.

Training usually uses teacher forcing and cross-entropy loss. Inference uses greedy decoding or beam search. Evaluation relies on metrics such as BLEU and learned semantic scoring systems. Multilingual and large language models extend translation to many languages and tasks within a unified architecture.