Cross-Lingual Transfer

Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language. It is important because labeled data is unevenly distributed across languages. English has many datasets and benchmarks. Many other languages have limited annotation, limited digital text, or domain-specific data that is expensive to label.

The goal is to share knowledge across languages. A model may learn sentiment classification, named entity recognition, question answering, or retrieval from high-resource languages, then apply that knowledge to lower-resource languages.

A cross-lingual model learns a function

f_\theta(x, \ell) \rightarrow y,

where $x$ is the input text, $\ell$ is the language, and $y$ is the task output. In many systems, the language ID is implicit rather than explicitly provided.

Why Cross-Lingual Transfer Works

Cross-lingual transfer works because many languages share structure at several levels.

At the semantic level, different languages can express the same meaning:

English: The movie was excellent.
Vietnamese: Bộ phim rất xuất sắc.
French: Le film était excellent.

At the task level, the labels may be shared. The sentiment label is positive in all three cases.

At the representation level, multilingual models learn embeddings where similar meanings across languages are close together. This allows a classifier trained on one language to generalize to another.

The model does not need identical words. It needs representations that preserve meaning and task-relevant distinctions.

Multilingual Tokenization

Cross-lingual models usually rely on shared tokenization. A multilingual tokenizer contains subword units from many languages.

For example, a sentence is converted into token IDs:

input_ids      # [B, T]
attention_mask # [B, T]

The vocabulary may contain Latin characters, accents, CJK characters, Arabic script, Cyrillic script, punctuation, numbers, and frequent subwords from many languages.

Subword tokenization helps with rare words. A word unseen during training can be broken into smaller known pieces.

However, tokenization quality varies by language. Some languages require more tokens per word or sentence. This increases sequence length and computation. It can also reduce model quality because information is fragmented across many subwords.

Shared Embedding Spaces

A multilingual model maps text from different languages into a shared representation space.

For an input sequence,

x = (x_1, x_2, \ldots, x_T),

the encoder produces hidden states

H \in \mathbb{R}^{T \times D}.

If the model is well aligned, sentences with similar meanings have similar representations even when written in different languages.

For sentence-level tasks, we may pool hidden states into a single vector:

h = \operatorname{pool}(H).

A classifier then predicts:

z = Wh + b.

The same classifier head can be applied across languages if the representations are aligned.

Zero-Shot Cross-Lingual Transfer

Zero-shot transfer means training on one language and evaluating on another language without labeled examples in the target language.

A common setup is:

Fine-tune a multilingual model on English labeled data.
Evaluate directly on Vietnamese, Arabic, Swahili, Hindi, or another target language.

For example:

Train: English sentiment dataset
Test: Vietnamese sentiment dataset

The model succeeds only if the multilingual encoder maps English and Vietnamese inputs into compatible task representations.

Zero-shot transfer is useful when target-language labels are unavailable. It is also fragile. Performance depends on language similarity, tokenizer coverage, pretraining data, domain match, and task difficulty.

Few-Shot Cross-Lingual Transfer

Few-shot transfer uses a small number of labeled examples in the target language.

A typical procedure:

Start with a multilingual pretrained model.
Fine-tune on a high-resource language.
Continue fine-tuning on a small target-language dataset.

Few-shot adaptation often gives large gains over zero-shot transfer. Even a few hundred examples can teach the model target-language label conventions, domain vocabulary, and annotation style.

Few-shot transfer is especially useful when labels are cheap to obtain in small quantities but expensive at scale.

Translate-Train and Translate-Test

Machine translation provides another route to cross-lingual transfer.

In translate-train, labeled examples from a source language are translated into the target language. The model is trained on the translated data.

English labeled data
→ translate to Vietnamese
→ train Vietnamese classifier

In translate-test, target-language inputs are translated into the source language. A source-language model is then applied.

Vietnamese test input
→ translate to English
→ English classifier

Translate-train can create large synthetic labeled datasets. Translate-test can be useful when a strong source-language model already exists.

Both methods depend on translation quality. Translation may distort named entities, sentiment, domain terms, legal meaning, or cultural references. Errors in translation become errors in supervision or inference.

Cross-Lingual Named Entity Recognition

Named entity recognition is difficult in cross-lingual settings because entity boundaries and surface forms differ across languages.

Consider:

English: President Macron visited Hanoi.
Vietnamese: Tổng thống Macron đã thăm Hà Nội.

The entity Macron transfers easily. The location Hà Nội includes accents and tokenization differences.

Cross-lingual NER must handle:

Issue	Example
Different scripts	English, Arabic, Chinese, Hindi
Transliteration	Moscow, Moskva, Москва
Inflection	Names change form by case or suffix
Token boundaries	Some languages lack whitespace segmentation
Local entities	Target language contains entities absent from source data

A common method is multilingual encoder fine-tuning with BIO labels. Another method projects labels through word alignment from translated text.

Cross-Lingual Retrieval

Cross-lingual retrieval finds documents in one language using queries in another language.

Example:

Query: symptoms of dengue fever
Documents: Vietnamese medical articles

The query and documents must be represented in a shared embedding space.

A dense retriever computes:

h_q = f_\theta(q_{\text{English}})

h_d = g_\theta(d_{\text{Vietnamese}})

and ranks documents by

s(q,d) = h_q^\top h_d.

Cross-lingual retrieval is useful for multilingual search, international legal research, patent search, news monitoring, and RAG over multilingual corpora.

Hybrid retrieval can help. Dense retrieval handles semantic transfer. Sparse retrieval handles names, numbers, citations, and exact terms.

Cross-Lingual Question Answering

Cross-lingual question answering may involve several language configurations.

Question language	Context language	Answer language
English	English	English
English	Vietnamese	English
Vietnamese	English	Vietnamese
Vietnamese	Vietnamese	Vietnamese
English	many languages	English

The system may retrieve evidence in multiple languages, translate passages, or generate an answer in the user’s language.

A retrieval-augmented cross-lingual QA pipeline may look like:

User question
→ multilingual retrieval
→ passage translation or direct multilingual reading
→ answer generation
→ citation preservation

The system must preserve source grounding. If evidence is in one language and the answer is in another, citations and translated claims must remain aligned.

Language Imbalance

Multilingual models are affected by language imbalance. High-resource languages dominate pretraining data. Low-resource languages receive fewer updates and often have worse tokenization.

This leads to several effects:

Effect	Consequence
Poor tokenization	More tokens per sentence
Weak representations	Lower downstream accuracy
Script bias	Some scripts receive weaker coverage
Domain gaps	Formal text works better than colloquial text
Cultural gaps	Idioms and local references are misunderstood

Language imbalance is not just a data quantity problem. It also affects vocabulary construction, benchmark design, and evaluation fairness.

Code-Switching

Many users mix languages in the same sentence or conversation.

Example:

Deploy model này lên GPU server như thế nào?

This sentence mixes English technical terms with Vietnamese grammar.

Code-switching is common in multilingual communities, technical work, social media, and customer support. Models must handle mixed vocabulary, mixed syntax, and mixed scripts.

Training data for code-switching is often limited. Useful methods include multilingual pretraining, synthetic code-switched data, and evaluation sets built from real user text.

Evaluation

Cross-lingual systems should be evaluated separately for each language. A single aggregate score can hide poor performance on low-resource languages.

Important evaluation dimensions include:

Dimension	Question
Per-language accuracy	Which languages work well or poorly?
Zero-shot transfer	Does training on source language generalize?
Few-shot adaptation	How much target data is needed?
Domain robustness	Does it work outside benchmark text?
Token efficiency	How many tokens are needed per input?
Script coverage	Does performance differ by writing system?
Fairness	Are some groups or dialects underserved?

For cross-lingual retrieval, evaluate recall@k and nDCG by language pair. For cross-lingual QA, evaluate both answer correctness and evidence grounding.

Practical Design Choices

A practical cross-lingual system should choose among several strategies.

Setting	Recommended approach
No target labels	Zero-shot multilingual fine-tuning
Small target labels	Few-shot target adaptation
Good machine translation available	Translate-train or translate-test
Multilingual search	Dense multilingual retrieval plus sparse exact matching
High-stakes domain	Human-reviewed translation and target-language evaluation
Code-switching users	Include real mixed-language examples
Low-resource language	Collect targeted data and inspect tokenizer behavior

There is no universal best method. The right choice depends on data availability, translation quality, latency, risk, and language coverage.

Common Failure Modes

Cross-lingual systems fail in predictable ways.

Failure mode	Description
Literal translation error	Idioms or sentiment are translated incorrectly
Entity corruption	Names are mistranslated or normalized wrongly
Script mismatch	Model fails on non-Latin text
Token explosion	Input becomes too long after tokenization
Cultural mismatch	Model misses local meaning or context
Domain mismatch	Pretraining lacks target-domain language
False semantic match	Dense retriever returns related but wrong passages
Evaluation masking	Average score hides weak languages
Code-switching failure	Mixed-language input is parsed incorrectly

Error analysis should be done by language and by task type. A model that works for English and French may fail badly for Vietnamese legal documents or Arabic social media.

Summary

Cross-lingual transfer allows knowledge learned in one language to support tasks in another language. It relies on multilingual tokenization, shared representation spaces, translation, and task-level alignment.

Zero-shot transfer is useful when no target-language labels exist. Few-shot transfer is usually stronger when small labeled datasets are available. Translate-train and translate-test are practical alternatives when machine translation quality is good.

Reliable cross-lingual systems require per-language evaluation, attention to tokenization, domain-specific testing, and explicit handling of low-resource languages, code-switching, and entity preservation.