Neural Machine Translation

Neural machine translation maps a sentence in one language to a sentence in another language using a neural sequence model. The model receives a source sentence and generates a target sentence.

For example:

source: I like cats.
target: J'aime les chats.

This is a conditional generation problem. The model learns the probability of a target sentence given a source sentence:

p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x).

Here $x$ is the source sentence and $y$ is the target sentence.

From Phrase Tables to Neural Translation

Older statistical machine translation systems used phrase tables, alignment models, and hand-engineered features. They decomposed translation into many separate components: word alignment, phrase extraction, language modeling, reordering, and decoding.

Neural machine translation replaces most of this pipeline with one trainable model. The encoder reads the source sentence. The decoder generates the target sentence. Attention or cross-attention lets the decoder focus on relevant source tokens.

The result is a single differentiable system trained with parallel text pairs.

Parallel Corpora

Machine translation requires paired examples:

(x^{(i)}, y^{(i)}),

where $x^{(i)}$ is a sentence in the source language and $y^{(i)}$ is its translation.

A dataset may contain examples such as:

Source	Target
`I like cats.`	`J'aime les chats.`
`Where is the station?`	`Où est la gare ?`
`The book is on the table.`	`Le livre est sur la table.`

The quality of the corpus matters. Good translation data should have correct sentence alignment, consistent language direction, little boilerplate, and limited duplication. Noisy pairs teach the model bad correspondences.

Tokenization for Translation

Translation systems usually tokenize text into subword units rather than whole words. Subword tokenization helps handle rare words, morphology, and open vocabulary generation.

Common methods include byte-pair encoding, WordPiece, SentencePiece, and byte-level tokenization.

A word such as

unbelievable

may be split into

un ##believ ##able

un believe able

depending on the tokenizer.

Subwords reduce out-of-vocabulary problems. The model can translate unfamiliar words by composing known pieces.

Shared and Separate Vocabularies

A translation model may use separate vocabularies for the source and target languages, or one shared vocabulary.

Separate vocabularies are natural when languages use different scripts. For example, English-to-Japanese may use separate source and target vocabularies.

A shared vocabulary can help when languages share scripts or many words. For example, English, French, Spanish, and German share numbers, punctuation, names, and some word pieces.

In multilingual models, shared tokenization is common because a single model must process many languages.

Encoder-Decoder Transformer for Translation

Modern neural translation systems usually use transformer encoder-decoder models.

The encoder receives source token IDs:

X \in \mathbb{N}^{B \times S}.

After embedding and positional encoding, the encoder produces contextual source states:

H \in \mathbb{R}^{B \times S \times D}.

The decoder receives shifted target tokens:

Y_{\text{in}} \in \mathbb{N}^{B \times T}.

It produces decoder states:

Z \in \mathbb{R}^{B \times T \times D}.

A final linear layer maps these states to target vocabulary logits:

\text{logits} \in \mathbb{R}^{B \times T \times V}.

The decoder uses three information sources: previous target tokens through masked self-attention, source sentence representations through cross-attention, and positional information through positional encodings.

Training Objective

Given a source sentence $x$ and target sentence $y$ , the model minimizes negative log-likelihood:

\mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x).

This is token-level cross-entropy.

In PyTorch, the training step has the same shifted-target structure described earlier:

import torch.nn.functional as F

tgt_input = tgt_tokens[:, :-1]
tgt_target = tgt_tokens[:, 1:]

logits = model(src_tokens, tgt_input, src_padding_mask, tgt_padding_mask)

loss = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),
    tgt_target.reshape(-1),
    ignore_index=pad_id,
)

The source padding mask prevents the encoder and cross-attention from attending to padding. The target padding mask prevents the decoder from treating padding as real text. The causal mask prevents the decoder from seeing future target tokens.

Teacher Forcing in Translation

During training, the decoder receives the correct previous target tokens. For the target sentence

<bos> J'aime les chats . <eos>

the decoder input is

<bos> J'aime les chats .

and the prediction target is

J'aime les chats . <eos>

This makes translation training efficient because all target positions can be trained in parallel.

During inference, the correct target sentence is unknown. The model begins with <bos> and generates one token at a time until it emits <eos>.

Decoding Translation Outputs

Translation usually uses greedy search or beam search.

Greedy search chooses the most likely token at each step. It is fast but can miss better complete translations.

Beam search keeps several candidate translations. It often improves translation quality, especially for moderate-length sentences.

A typical translation decoding configuration may use:

Setting	Typical value
Beam size	4 to 8
Length penalty	0.6 to 1.0
Maximum length	Source length times 1.5 to 2.0
Repetition constraint	Optional

Length penalty matters because raw log probability favors shorter translations.

Alignment and Cross-Attention

In attention-based translation, decoder tokens attend to source tokens. This often creates a soft alignment between source and target words.

For example, when producing the French word

chats

the decoder may attend strongly to the English word

cats

Cross-attention weights are not guaranteed to be exact linguistic alignments, but they often provide useful diagnostic information.

In transformer translation, each decoder layer has cross-attention. Each target position can query the source sequence and combine information from relevant source tokens.

Handling Word Order

Languages differ in word order. English usually follows subject-verb-object order. Japanese often follows subject-object-verb order. Translation requires reordering information.

The encoder-decoder architecture handles this because the decoder generates the target sentence in target-language order. Cross-attention lets each target position retrieve the relevant source content, even if the source token appears far away.

For example:

English: I eat sushi.
Japanese: 私は寿司を食べます。

The verb appears earlier in English and later in Japanese. The decoder controls target ordering while cross-attention retrieves source meaning.

Morphology and Agreement

Translation must handle morphology: tense, number, gender, case, politeness, and other grammatical features.

For example, English adjectives do not usually change for gender, but French adjectives may:

English: a small house
French: une petite maison

The target adjective petite agrees with the feminine noun maison.

The model learns such patterns from parallel data. Good tokenization, sufficient data, and strong contextual representations are important for handling morphology.

Low-Resource Translation

Many language pairs have little parallel data. This is called low-resource translation.

Common techniques include transfer learning, multilingual training, back-translation, data filtering, and synthetic data generation.

Back-translation is especially common. Suppose we want to train English-to-Vietnamese translation but have limited English-Vietnamese data. If we have monolingual Vietnamese text, we can translate it back into English using a Vietnamese-to-English model. This creates synthetic parallel pairs.

synthetic English -> real Vietnamese

The forward model then trains on both real and synthetic data.

Multilingual Translation

A multilingual translation model handles many language pairs with one model.

The input often includes a target-language tag:

<to_fr> I like cats.
<to_de> I like cats.
<to_vi> I like cats.

This tells the model which language to generate.

Multilingual models can transfer knowledge across related languages. They are useful when some languages have large datasets and others have small datasets.

However, multilingual training also introduces capacity competition. High-resource languages may dominate training unless data sampling is carefully controlled.

Domain Adaptation

A translation model trained on news text may perform poorly on legal contracts, medical documents, source code comments, or chat messages. This is a domain shift problem.

Domain adaptation methods include fine-tuning on in-domain parallel data, mixing general and domain-specific corpora, using terminology constraints, and retrieval-augmented translation memories.

Terminology is important in specialized domains. For example, the same English word may require different translations in legal and medical contexts.

Evaluation

Translation quality can be evaluated automatically and manually.

Common automatic metrics include BLEU, chrF, TER, COMET, and BLEURT. BLEU measures n-gram overlap with reference translations. chrF uses character n-grams and is useful for morphologically rich languages. Learned metrics such as COMET and BLEURT often correlate better with human judgments.

Automatic metrics are useful for model development, but they are incomplete. They may miss adequacy, fluency, terminology accuracy, style, and meaning preservation.

Human evaluation usually considers adequacy and fluency. Adequacy asks whether the translation preserves meaning. Fluency asks whether the output is natural in the target language.

Common Translation Errors

Neural translation models make several recurring errors.

They may omit source content, especially in long sentences. They may hallucinate content that was not present. They may mistranslate rare names, numbers, dates, or technical terms. They may choose the wrong level of formality. They may produce fluent text that changes the meaning.

For high-stakes translation, output should be checked by a human or constrained with domain-specific terminology tools.

Minimal Translation Batch Example

A batch for translation often contains padded source and target token tensors.

import torch

pad_id = 0
bos_id = 1
eos_id = 2

src_tokens = torch.tensor([
    [10, 25, 31, 2, 0, 0],
    [14, 90, 87, 33, 51, 2],
])

tgt_tokens = torch.tensor([
    [1, 44, 18, 72, 2, 0],
    [1, 60, 12, 19, 84, 2],
])

src_padding_mask = src_tokens == pad_id

tgt_input = tgt_tokens[:, :-1]
tgt_target = tgt_tokens[:, 1:]

tgt_padding_mask = tgt_input == pad_id

The model receives src_tokens and tgt_input. The loss is computed against tgt_target.

The source shape is

[B, S]

The decoder input shape is

[B, T]

The logits shape is

[B, T, V]

Practical PyTorch Translation Skeleton

A simplified training step has the following structure:

def training_step(model, batch, optimizer, pad_id):
    model.train()

    src_tokens = batch["src_tokens"]
    tgt_tokens = batch["tgt_tokens"]

    tgt_input = tgt_tokens[:, :-1]
    tgt_target = tgt_tokens[:, 1:]

    src_padding_mask = src_tokens == pad_id
    tgt_padding_mask = tgt_input == pad_id

    logits = model(
        src_tokens=src_tokens,
        tgt_tokens=tgt_input,
        src_padding_mask=src_padding_mask,
        tgt_padding_mask=tgt_padding_mask,
    )

    loss = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        tgt_target.reshape(-1),
        ignore_index=pad_id,
    )

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

This is the core of supervised neural machine translation training.

Summary

Neural machine translation is a sequence-to-sequence problem. The source sentence is encoded, and the target sentence is generated autoregressively.

Modern systems usually use transformer encoder-decoder models. They are trained with parallel corpora, subword tokenization, teacher forcing, causal masking, padding masks, and token-level cross-entropy loss.

At inference time, translation is generated with greedy decoding or beam search. Practical systems must handle length bias, domain shift, terminology, low-resource languages, multilingual transfer, and evaluation beyond simple n-gram overlap.