Machine Translation

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.

For example:

Source language	Target language
`the cat sat on the mat`	`le chat s'est assis sur le tapis`
`good morning`	`buenos días`
`where is the station?`	`駅はどこですか`

Modern neural machine translation systems are usually sequence-to-sequence models built with transformers.

The core problem is conditional sequence modeling. Given a source sequence

x = (x_1, x_2, \dots, x_S),

the model predicts a target sequence

y = (y_1, y_2, \dots, y_T).

The goal is to maximize:

P(y \mid x).

The translation model predicts the target tokens one step at a time:

P(y \mid x) = \prod_{t=1}^{T} P(y_t \mid y_{<t}, x).

The next target token depends on both the source sentence and the previously generated target tokens.

Sequence-to-Sequence Learning

Neural machine translation is usually formulated as sequence-to-sequence learning.

A sequence-to-sequence model contains:

Component	Purpose
Encoder	Reads the source sequence
Decoder	Generates the target sequence
Attention	Connects decoder states to source representations

The encoder transforms the source sequence into contextual representations:

H = (h_1, h_2, \dots, h_S).

The decoder then generates target tokens autoregressively.

At decoding step $t$ , the decoder predicts:

P(y_t \mid y_{<t}, H).

This structure became dominant because it can handle variable-length inputs and outputs. Earlier phrase-based statistical systems relied heavily on hand-engineered alignment rules and feature systems. Neural sequence models learn translation behavior directly from data.

Tokenization in Translation Systems

Both source and target languages are tokenized before training.

Suppose:

English:  the cat sat
French:   le chat s'est assis

The tokenizer converts text into token IDs.

Source:

[12, 85, 901]

Target:

[44, 219, 777, 602]

Translation systems usually use subword tokenization because vocabularies must handle many languages, names, compounds, and rare words.

A multilingual tokenizer may share subword units across languages. This allows the model to reuse representations between related words and scripts.

For example:

international
internacional
internationale

may partially share subword pieces.

Encoder Representations

The encoder processes the source tokens and produces contextual vectors.

If the source token tensor has shape:

[B, S]

then the encoder output usually has shape:

[B, S, D]

where:

Symbol	Meaning
$B$	Batch size
$S$	Source sequence length
$D$	Hidden dimension

Each source token receives a contextual representation. In a transformer encoder, self-attention allows every source token to interact with every other source token.

For example, the word bank in:

the boat reached the bank

receives a different contextual representation from bank in:

she deposited money at the bank

because the surrounding context changes the hidden state.

Decoder Generation

The decoder generates one target token at a time.

At decoding step $t$ , the decoder receives:

Input	Description
Previous target tokens	$y_1, \dots, y_{t-1}$
Encoder representations	Source hidden states
Positional information	Token positions

The decoder predicts logits over the target vocabulary:

z_t \in \mathbb{R}^{|V|}.

A softmax converts logits into probabilities:

P(y_t \mid y_{<t}, x) = \operatorname{softmax}(z_t).

The highest-probability token may then be selected or sampled.

Teacher Forcing

During training, the decoder usually receives the true previous target tokens rather than its own predictions. This is called teacher forcing.

Suppose the target sentence is:

le chat dort

Training proceeds as:

Decoder input	Prediction target
`<bos>`	`le`
`<bos> le`	`chat`
`<bos> le chat`	`dort`
`<bos> le chat dort`	`<eos>`

Teacher forcing stabilizes training because the decoder always receives correct history tokens. Without it, prediction errors early in the sequence could quickly corrupt later predictions.

However, teacher forcing creates a mismatch between training and inference. During inference, the model must condition on its own generated outputs.

Attention Mechanisms

Attention allows the decoder to focus on relevant source positions while generating each target token.

Suppose the source sentence is:

the black cat sleeps

When generating the French word:

chat

the decoder should focus strongly on cat.

Attention computes alignment scores between the decoder state and encoder states.

If:

q_t

is the decoder query vector at step $t$ , and:

K = (k_1, \dots, k_S)

are encoder key vectors, then attention scores are computed as:

\alpha_{ti} = \operatorname{softmax}(q_t^\top k_i).

The decoder then forms a weighted combination of encoder value vectors:

c_t = \sum_{i=1}^{S} \alpha_{ti} v_i.

The context vector $c_t$ summarizes the relevant source information for the current decoding step.

Transformer Translation Models

Modern translation systems usually use transformer architectures.

A transformer translation model contains:

Component	Description
Encoder self-attention	Source tokens attend to source tokens
Decoder self-attention	Target tokens attend to earlier target tokens
Cross-attention	Decoder attends to encoder outputs

The decoder uses a causal mask so that future target tokens remain hidden during training.

The overall architecture is:

source tokens
-> encoder
-> encoder hidden states
-> decoder with cross-attention
-> target logits

Transformers replaced recurrent translation systems because they parallelize efficiently and model long-range dependencies better.

A Minimal Seq2Seq Transformer Interface

A translation model often receives:

source_ids: [B, S]
source_mask: [B, S]

target_ids: [B, T]
target_mask: [B, T]

The model outputs:

logits: [B, T, V]

where $V$ is the target vocabulary size.

A simplified PyTorch-style interface:

class TranslationModel(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(
        self,
        source_ids,
        source_mask,
        target_ids,
        target_mask,
    ):
        encoder_hidden = self.encoder(
            input_ids=source_ids,
            attention_mask=source_mask,
        )

        logits = self.decoder(
            input_ids=target_ids,
            attention_mask=target_mask,
            encoder_hidden_states=encoder_hidden,
            encoder_attention_mask=source_mask,
        )

        return logits

The encoder processes the source sentence once. The decoder repeatedly attends to those encoder representations while generating the translation.

Training Objective

Translation models are usually trained with token-level cross-entropy loss.

Suppose:

logits: [B, T, V]
targets: [B, T]

We flatten the batch and sequence dimensions:

loss_fn = nn.CrossEntropyLoss(ignore_index=0)

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The target tensor contains the expected next token at each position.

Padding tokens are ignored.

Greedy Decoding

During inference, the decoder generates tokens autoregressively.

The simplest decoding strategy is greedy decoding.

At each step:

y_t = \arg\max_k P(k \mid y_{<t}, x).

The model selects the highest-probability next token.

Example:

Step	Output
`<bos>`	`le`
`le`	`chat`
`le chat`	`dort`
`le chat dort`	`<eos>`

Greedy decoding is simple and fast, but it may produce suboptimal translations because it commits to local decisions too early.

Beam Search

Beam search keeps several candidate translations instead of only one.

At each step:

Expand each candidate with possible next tokens.
Compute cumulative log probabilities.
Keep the top $k$ sequences.

The parameter $k$ is called the beam width.

Beam search approximates:

\arg\max_y P(y \mid x).

It usually improves translation quality compared with greedy decoding.

However, very large beam sizes can reduce diversity and sometimes produce repetitive or overly generic translations.

Length Normalization

Longer sequences tend to have lower cumulative probabilities because probabilities are multiplied across many steps.

Without correction, beam search may prefer short outputs.

Length normalization adjusts sequence scores:

\frac{1}{T^\alpha} \sum_{t=1}^{T} \log P(y_t \mid y_{<t}, x).

The parameter $\alpha$ controls the strength of normalization.

This helps prevent beam search from terminating too early.

Exposure Bias

Teacher forcing creates a mismatch between training and inference.

During training:

decoder input = gold tokens

During inference:

decoder input = model predictions

This mismatch is called exposure bias.

A mistake during inference can push the model into contexts it never saw during training.

Several methods attempt to reduce exposure bias:

Method	Idea
Scheduled sampling	Occasionally feed model predictions during training
Sequence-level training	Optimize full generated sequences
Reinforcement learning	Use task-level rewards
Data augmentation	Expose model to noisy histories

In practice, standard teacher forcing with large datasets and transformers often works surprisingly well.

Evaluation Metrics

Translation quality is difficult to measure automatically because many valid translations exist.

Suppose the reference translation is:

the cat is sleeping

The model output:

the cat sleeps

may still be correct.

The most common automatic metric is BLEU.

BLEU measures overlap between generated and reference n-grams while penalizing overly short outputs.

Other metrics include:

Metric	Main idea
BLEU	N-gram overlap
ROUGE	Recall-oriented overlap
METEOR	Flexible matching with stemming and synonyms
chrF	Character-level overlap
COMET	Learned neural evaluation metric
BLEURT	Learned semantic similarity metric

Modern evaluation increasingly uses learned metrics because simple n-gram overlap correlates imperfectly with human judgment.

Multilingual Translation

Multilingual translation systems train on many language pairs simultaneously.

A single model may support:

English -> French
English -> German
French -> English
German -> Spanish

and many more.

A common strategy prepends a language-control token:

<fr> the cat sleeps

The decoder then generates French text.

Multilingual systems can transfer knowledge across languages. High-resource languages may improve low-resource languages through shared representations.

However, multilingual systems also face challenges:

Challenge	Description
Vocabulary imbalance	Some languages receive many more tokens
Script diversity	Different alphabets and writing systems
Data imbalance	High-resource languages dominate training
Long-tail morphology	Complex word structures
Token efficiency	Some languages require more tokens

Alignment and Attention Maps

Translation attention patterns often reveal approximate word alignments.

For example:

Source	Attention target
`cat`	`chat`
`black`	`noir`
`sleeps`	`dort`

Attention matrices can sometimes be visualized as heatmaps.

However, transformer attention is not always a reliable explanation mechanism. Multiple attention heads interact, and the model’s behavior depends on deeper nonlinear computations.

Still, attention maps are often useful debugging tools.

Hallucinations and Translation Errors

Translation systems may produce:

Error type	Example
Omission	Missing information
Addition	Hallucinated content
Reordering error	Incorrect syntax
Agreement error	Wrong grammatical agreement
Named entity corruption	Wrong person or location names
Literal translation	Grammatically unnatural output

Hallucinations are especially dangerous in medical, legal, or financial translation systems.

A translation model may generate fluent but incorrect text. Fluency does not guarantee factual accuracy.

Low-Resource Translation

Some language pairs have limited parallel training data.

Common strategies include:

Method	Idea
Multilingual pretraining	Share knowledge across languages
Back-translation	Generate synthetic parallel data
Transfer learning	Fine-tune from related languages
Unsupervised translation	Use monolingual corpora only
Retrieval augmentation	Retrieve similar bilingual examples

Back-translation is particularly important. A reverse translation model generates synthetic source sentences from target-language monolingual data, increasing training data size.

Translation with Large Language Models

Large language models can perform translation without task-specific fine-tuning.

For example:

Translate to German:
The weather is beautiful today.

The model generates:

Das Wetter ist heute schön.

Instruction tuning and multilingual pretraining allow general-purpose models to perform translation as one capability among many.

However, specialized translation systems may still outperform general LLMs on domain accuracy, terminology consistency, and latency.

Summary

Machine translation maps source sequences to target sequences. Modern systems are usually transformer-based sequence-to-sequence models with encoder-decoder architectures and attention mechanisms.

Tokenization converts text into subword units. The encoder computes contextual source representations. The decoder autoregressively predicts target tokens while attending to the source sequence.

Training usually uses teacher forcing and cross-entropy loss. Inference uses greedy decoding or beam search. Evaluation relies on metrics such as BLEU and learned semantic scoring systems. Multilingual and large language models extend translation to many languages and tasks within a unified architecture.