# Machine Translation

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.

For example:

| Source language | Target language |
|---|---|
| `the cat sat on the mat` | `le chat s'est assis sur le tapis` |
| `good morning` | `buenos días` |
| `where is the station?` | `駅はどこですか` |

Modern neural machine translation systems are usually sequence-to-sequence models built with transformers.

The core problem is conditional sequence modeling. Given a source sequence

$$
x = (x_1, x_2, \dots, x_S),
$$

the model predicts a target sequence

$$
y = (y_1, y_2, \dots, y_T).
$$

The goal is to maximize:

$$
P(y \mid x).
$$

The translation model predicts the target tokens one step at a time:

$$
P(y \mid x) =
\prod_{t=1}^{T}
P(y_t \mid y_{<t}, x).
$$

The next target token depends on both the source sentence and the previously generated target tokens.

### Sequence-to-Sequence Learning

Neural machine translation is usually formulated as sequence-to-sequence learning.

A sequence-to-sequence model contains:

| Component | Purpose |
|---|---|
| Encoder | Reads the source sequence |
| Decoder | Generates the target sequence |
| Attention | Connects decoder states to source representations |

The encoder transforms the source sequence into contextual representations:

$$
H =
(h_1, h_2, \dots, h_S).
$$

The decoder then generates target tokens autoregressively.

At decoding step $t$, the decoder predicts:

$$
P(y_t \mid y_{<t}, H).
$$

This structure became dominant because it can handle variable-length inputs and outputs. Earlier phrase-based statistical systems relied heavily on hand-engineered alignment rules and feature systems. Neural sequence models learn translation behavior directly from data.

### Tokenization in Translation Systems

Both source and target languages are tokenized before training.

Suppose:

```text id="30f34z"
English:  the cat sat
French:   le chat s'est assis
```

The tokenizer converts text into token IDs.

Source:

```text id="m3wb3p"
[12, 85, 901]
```

Target:

```text id="u2rttv"
[44, 219, 777, 602]
```

Translation systems usually use subword tokenization because vocabularies must handle many languages, names, compounds, and rare words.

A multilingual tokenizer may share subword units across languages. This allows the model to reuse representations between related words and scripts.

For example:

```text id="nyx9tt"
international
internacional
internationale
```

may partially share subword pieces.

### Encoder Representations

The encoder processes the source tokens and produces contextual vectors.

If the source token tensor has shape:

```text id="6zzfaj"
[B, S]
```

then the encoder output usually has shape:

```text id="jdhktb"
[B, S, D]
```

where:

| Symbol | Meaning |
|---|---|
| $B$ | Batch size |
| $S$ | Source sequence length |
| $D$ | Hidden dimension |

Each source token receives a contextual representation. In a transformer encoder, self-attention allows every source token to interact with every other source token.

For example, the word `bank` in:

```text id="pmup73"
the boat reached the bank
```

receives a different contextual representation from `bank` in:

```text id="3j1gg4"
she deposited money at the bank
```

because the surrounding context changes the hidden state.

### Decoder Generation

The decoder generates one target token at a time.

At decoding step $t$, the decoder receives:

| Input | Description |
|---|---|
| Previous target tokens | $y_1, \dots, y_{t-1}$ |
| Encoder representations | Source hidden states |
| Positional information | Token positions |

The decoder predicts logits over the target vocabulary:

$$
z_t \in \mathbb{R}^{|V|}.
$$

A softmax converts logits into probabilities:

$$
P(y_t \mid y_{<t}, x) =
\operatorname{softmax}(z_t).
$$

The highest-probability token may then be selected or sampled.

### Teacher Forcing

During training, the decoder usually receives the true previous target tokens rather than its own predictions. This is called teacher forcing.

Suppose the target sentence is:

```text id="a3mn1s"
le chat dort
```

Training proceeds as:

| Decoder input | Prediction target |
|---|---|
| `<bos>` | `le` |
| `<bos> le` | `chat` |
| `<bos> le chat` | `dort` |
| `<bos> le chat dort` | `<eos>` |

Teacher forcing stabilizes training because the decoder always receives correct history tokens. Without it, prediction errors early in the sequence could quickly corrupt later predictions.

However, teacher forcing creates a mismatch between training and inference. During inference, the model must condition on its own generated outputs.

### Attention Mechanisms

Attention allows the decoder to focus on relevant source positions while generating each target token.

Suppose the source sentence is:

```text id="1tyy0m"
the black cat sleeps
```

When generating the French word:

```text id="zhg1d0"
chat
```

the decoder should focus strongly on `cat`.

Attention computes alignment scores between the decoder state and encoder states.

If:

$$
q_t
$$

is the decoder query vector at step $t$, and:

$$
K = (k_1, \dots, k_S)
$$

are encoder key vectors, then attention scores are computed as:

$$
\alpha_{ti} =
\operatorname{softmax}(q_t^\top k_i).
$$

The decoder then forms a weighted combination of encoder value vectors:

$$
c_t =
\sum_{i=1}^{S}
\alpha_{ti} v_i.
$$

The context vector $c_t$ summarizes the relevant source information for the current decoding step.

### Transformer Translation Models

Modern translation systems usually use transformer architectures.

A transformer translation model contains:

| Component | Description |
|---|---|
| Encoder self-attention | Source tokens attend to source tokens |
| Decoder self-attention | Target tokens attend to earlier target tokens |
| Cross-attention | Decoder attends to encoder outputs |

The decoder uses a causal mask so that future target tokens remain hidden during training.

The overall architecture is:

```text id="gnm4tq"
source tokens
-> encoder
-> encoder hidden states
-> decoder with cross-attention
-> target logits
```

Transformers replaced recurrent translation systems because they parallelize efficiently and model long-range dependencies better.

### A Minimal Seq2Seq Transformer Interface

A translation model often receives:

```python id="b1n0r4"
source_ids: [B, S]
source_mask: [B, S]

target_ids: [B, T]
target_mask: [B, T]
```

The model outputs:

```python id="t9a5z4"
logits: [B, T, V]
```

where $V$ is the target vocabulary size.

A simplified PyTorch-style interface:

```python id="yr4c65"
class TranslationModel(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(
        self,
        source_ids,
        source_mask,
        target_ids,
        target_mask,
    ):
        encoder_hidden = self.encoder(
            input_ids=source_ids,
            attention_mask=source_mask,
        )

        logits = self.decoder(
            input_ids=target_ids,
            attention_mask=target_mask,
            encoder_hidden_states=encoder_hidden,
            encoder_attention_mask=source_mask,
        )

        return logits
```

The encoder processes the source sentence once. The decoder repeatedly attends to those encoder representations while generating the translation.

### Training Objective

Translation models are usually trained with token-level cross-entropy loss.

Suppose:

```text id="v1gq0q"
logits: [B, T, V]
targets: [B, T]
```

We flatten the batch and sequence dimensions:

```python id="3j1q5t"
loss_fn = nn.CrossEntropyLoss(ignore_index=0)

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)
```

The target tensor contains the expected next token at each position.

Padding tokens are ignored.

### Greedy Decoding

During inference, the decoder generates tokens autoregressively.

The simplest decoding strategy is greedy decoding.

At each step:

$$
y_t = \arg\max_k P(k \mid y_{<t}, x).
$$

The model selects the highest-probability next token.

Example:

| Step | Output |
|---|---|
| `<bos>` | `le` |
| `le` | `chat` |
| `le chat` | `dort` |
| `le chat dort` | `<eos>` |

Greedy decoding is simple and fast, but it may produce suboptimal translations because it commits to local decisions too early.

### Beam Search

Beam search keeps several candidate translations instead of only one.

At each step:

1. Expand each candidate with possible next tokens.
2. Compute cumulative log probabilities.
3. Keep the top $k$ sequences.

The parameter $k$ is called the beam width.

Beam search approximates:

$$
\arg\max_y P(y \mid x).
$$

It usually improves translation quality compared with greedy decoding.

However, very large beam sizes can reduce diversity and sometimes produce repetitive or overly generic translations.

### Length Normalization

Longer sequences tend to have lower cumulative probabilities because probabilities are multiplied across many steps.

Without correction, beam search may prefer short outputs.

Length normalization adjusts sequence scores:

$$
\frac{1}{T^\alpha}
\sum_{t=1}^{T}
\log P(y_t \mid y_{<t}, x).
$$

The parameter $\alpha$ controls the strength of normalization.

This helps prevent beam search from terminating too early.

### Exposure Bias

Teacher forcing creates a mismatch between training and inference.

During training:

```text id="wyv5n7"
decoder input = gold tokens
```

During inference:

```text id="zjvw0d"
decoder input = model predictions
```

This mismatch is called exposure bias.

A mistake during inference can push the model into contexts it never saw during training.

Several methods attempt to reduce exposure bias:

| Method | Idea |
|---|---|
| Scheduled sampling | Occasionally feed model predictions during training |
| Sequence-level training | Optimize full generated sequences |
| Reinforcement learning | Use task-level rewards |
| Data augmentation | Expose model to noisy histories |

In practice, standard teacher forcing with large datasets and transformers often works surprisingly well.

### Evaluation Metrics

Translation quality is difficult to measure automatically because many valid translations exist.

Suppose the reference translation is:

```text id="2v8i4l"
the cat is sleeping
```

The model output:

```text id="k2t7r0"
the cat sleeps
```

may still be correct.

The most common automatic metric is BLEU.

BLEU measures overlap between generated and reference n-grams while penalizing overly short outputs.

Other metrics include:

| Metric | Main idea |
|---|---|
| BLEU | N-gram overlap |
| ROUGE | Recall-oriented overlap |
| METEOR | Flexible matching with stemming and synonyms |
| chrF | Character-level overlap |
| COMET | Learned neural evaluation metric |
| BLEURT | Learned semantic similarity metric |

Modern evaluation increasingly uses learned metrics because simple n-gram overlap correlates imperfectly with human judgment.

### Multilingual Translation

Multilingual translation systems train on many language pairs simultaneously.

A single model may support:

```text id="2b0pt3"
English -> French
English -> German
French -> English
German -> Spanish
```

and many more.

A common strategy prepends a language-control token:

```text id="ztfzsv"
<fr> the cat sleeps
```

The decoder then generates French text.

Multilingual systems can transfer knowledge across languages. High-resource languages may improve low-resource languages through shared representations.

However, multilingual systems also face challenges:

| Challenge | Description |
|---|---|
| Vocabulary imbalance | Some languages receive many more tokens |
| Script diversity | Different alphabets and writing systems |
| Data imbalance | High-resource languages dominate training |
| Long-tail morphology | Complex word structures |
| Token efficiency | Some languages require more tokens |

### Alignment and Attention Maps

Translation attention patterns often reveal approximate word alignments.

For example:

| Source | Attention target |
|---|---|
| `cat` | `chat` |
| `black` | `noir` |
| `sleeps` | `dort` |

Attention matrices can sometimes be visualized as heatmaps.

However, transformer attention is not always a reliable explanation mechanism. Multiple attention heads interact, and the model’s behavior depends on deeper nonlinear computations.

Still, attention maps are often useful debugging tools.

### Hallucinations and Translation Errors

Translation systems may produce:

| Error type | Example |
|---|---|
| Omission | Missing information |
| Addition | Hallucinated content |
| Reordering error | Incorrect syntax |
| Agreement error | Wrong grammatical agreement |
| Named entity corruption | Wrong person or location names |
| Literal translation | Grammatically unnatural output |

Hallucinations are especially dangerous in medical, legal, or financial translation systems.

A translation model may generate fluent but incorrect text. Fluency does not guarantee factual accuracy.

### Low-Resource Translation

Some language pairs have limited parallel training data.

Common strategies include:

| Method | Idea |
|---|---|
| Multilingual pretraining | Share knowledge across languages |
| Back-translation | Generate synthetic parallel data |
| Transfer learning | Fine-tune from related languages |
| Unsupervised translation | Use monolingual corpora only |
| Retrieval augmentation | Retrieve similar bilingual examples |

Back-translation is particularly important. A reverse translation model generates synthetic source sentences from target-language monolingual data, increasing training data size.

### Translation with Large Language Models

Large language models can perform translation without task-specific fine-tuning.

For example:

```text id="x04d40"
Translate to German:
The weather is beautiful today.
```

The model generates:

```text id="76zv2r"
Das Wetter ist heute schön.
```

Instruction tuning and multilingual pretraining allow general-purpose models to perform translation as one capability among many.

However, specialized translation systems may still outperform general LLMs on domain accuracy, terminology consistency, and latency.

### Summary

Machine translation maps source sequences to target sequences. Modern systems are usually transformer-based sequence-to-sequence models with encoder-decoder architectures and attention mechanisms.

Tokenization converts text into subword units. The encoder computes contextual source representations. The decoder autoregressively predicts target tokens while attending to the source sequence.

Training usually uses teacher forcing and cross-entropy loss. Inference uses greedy decoding or beam search. Evaluation relies on metrics such as BLEU and learned semantic scoring systems. Multilingual and large language models extend translation to many languages and tasks within a unified architecture.