Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language.
Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language. It is important because labeled data is unevenly distributed across languages. English has many datasets and benchmarks. Many other languages have limited annotation, limited digital text, or domain-specific data that is expensive to label.
The goal is to share knowledge across languages. A model may learn sentiment classification, named entity recognition, question answering, or retrieval from high-resource languages, then apply that knowledge to lower-resource languages.
A cross-lingual model learns a function
where is the input text, is the language, and is the task output. In many systems, the language ID is implicit rather than explicitly provided.
Why Cross-Lingual Transfer Works
Cross-lingual transfer works because many languages share structure at several levels.
At the semantic level, different languages can express the same meaning:
English: The movie was excellent.
Vietnamese: Bộ phim rất xuất sắc.
French: Le film était excellent.At the task level, the labels may be shared. The sentiment label is positive in all three cases.
At the representation level, multilingual models learn embeddings where similar meanings across languages are close together. This allows a classifier trained on one language to generalize to another.
The model does not need identical words. It needs representations that preserve meaning and task-relevant distinctions.
Multilingual Tokenization
Cross-lingual models usually rely on shared tokenization. A multilingual tokenizer contains subword units from many languages.
For example, a sentence is converted into token IDs:
input_ids # [B, T]
attention_mask # [B, T]The vocabulary may contain Latin characters, accents, CJK characters, Arabic script, Cyrillic script, punctuation, numbers, and frequent subwords from many languages.
Subword tokenization helps with rare words. A word unseen during training can be broken into smaller known pieces.
However, tokenization quality varies by language. Some languages require more tokens per word or sentence. This increases sequence length and computation. It can also reduce model quality because information is fragmented across many subwords.
Shared Embedding Spaces
A multilingual model maps text from different languages into a shared representation space.
For an input sequence,
the encoder produces hidden states
If the model is well aligned, sentences with similar meanings have similar representations even when written in different languages.
For sentence-level tasks, we may pool hidden states into a single vector:
A classifier then predicts:
The same classifier head can be applied across languages if the representations are aligned.
Zero-Shot Cross-Lingual Transfer
Zero-shot transfer means training on one language and evaluating on another language without labeled examples in the target language.
A common setup is:
- Fine-tune a multilingual model on English labeled data.
- Evaluate directly on Vietnamese, Arabic, Swahili, Hindi, or another target language.
For example:
Train: English sentiment dataset
Test: Vietnamese sentiment datasetThe model succeeds only if the multilingual encoder maps English and Vietnamese inputs into compatible task representations.
Zero-shot transfer is useful when target-language labels are unavailable. It is also fragile. Performance depends on language similarity, tokenizer coverage, pretraining data, domain match, and task difficulty.
Few-Shot Cross-Lingual Transfer
Few-shot transfer uses a small number of labeled examples in the target language.
A typical procedure:
- Start with a multilingual pretrained model.
- Fine-tune on a high-resource language.
- Continue fine-tuning on a small target-language dataset.
Few-shot adaptation often gives large gains over zero-shot transfer. Even a few hundred examples can teach the model target-language label conventions, domain vocabulary, and annotation style.
Few-shot transfer is especially useful when labels are cheap to obtain in small quantities but expensive at scale.
Translate-Train and Translate-Test
Machine translation provides another route to cross-lingual transfer.
In translate-train, labeled examples from a source language are translated into the target language. The model is trained on the translated data.
English labeled data
→ translate to Vietnamese
→ train Vietnamese classifierIn translate-test, target-language inputs are translated into the source language. A source-language model is then applied.
Vietnamese test input
→ translate to English
→ English classifierTranslate-train can create large synthetic labeled datasets. Translate-test can be useful when a strong source-language model already exists.
Both methods depend on translation quality. Translation may distort named entities, sentiment, domain terms, legal meaning, or cultural references. Errors in translation become errors in supervision or inference.
Cross-Lingual Named Entity Recognition
Named entity recognition is difficult in cross-lingual settings because entity boundaries and surface forms differ across languages.
Consider:
English: President Macron visited Hanoi.
Vietnamese: Tổng thống Macron đã thăm Hà Nội.The entity Macron transfers easily. The location Hà Nội includes accents and tokenization differences.
Cross-lingual NER must handle:
| Issue | Example |
|---|---|
| Different scripts | English, Arabic, Chinese, Hindi |
| Transliteration | Moscow, Moskva, Москва |
| Inflection | Names change form by case or suffix |
| Token boundaries | Some languages lack whitespace segmentation |
| Local entities | Target language contains entities absent from source data |
A common method is multilingual encoder fine-tuning with BIO labels. Another method projects labels through word alignment from translated text.
Cross-Lingual Retrieval
Cross-lingual retrieval finds documents in one language using queries in another language.
Example:
Query: symptoms of dengue fever
Documents: Vietnamese medical articlesThe query and documents must be represented in a shared embedding space.
A dense retriever computes:
and ranks documents by
Cross-lingual retrieval is useful for multilingual search, international legal research, patent search, news monitoring, and RAG over multilingual corpora.
Hybrid retrieval can help. Dense retrieval handles semantic transfer. Sparse retrieval handles names, numbers, citations, and exact terms.
Cross-Lingual Question Answering
Cross-lingual question answering may involve several language configurations.
| Question language | Context language | Answer language |
|---|---|---|
| English | English | English |
| English | Vietnamese | English |
| Vietnamese | English | Vietnamese |
| Vietnamese | Vietnamese | Vietnamese |
| English | many languages | English |
The system may retrieve evidence in multiple languages, translate passages, or generate an answer in the user’s language.
A retrieval-augmented cross-lingual QA pipeline may look like:
User question
→ multilingual retrieval
→ passage translation or direct multilingual reading
→ answer generation
→ citation preservationThe system must preserve source grounding. If evidence is in one language and the answer is in another, citations and translated claims must remain aligned.
Language Imbalance
Multilingual models are affected by language imbalance. High-resource languages dominate pretraining data. Low-resource languages receive fewer updates and often have worse tokenization.
This leads to several effects:
| Effect | Consequence |
|---|---|
| Poor tokenization | More tokens per sentence |
| Weak representations | Lower downstream accuracy |
| Script bias | Some scripts receive weaker coverage |
| Domain gaps | Formal text works better than colloquial text |
| Cultural gaps | Idioms and local references are misunderstood |
Language imbalance is not just a data quantity problem. It also affects vocabulary construction, benchmark design, and evaluation fairness.
Code-Switching
Many users mix languages in the same sentence or conversation.
Example:
Deploy model này lên GPU server như thế nào?This sentence mixes English technical terms with Vietnamese grammar.
Code-switching is common in multilingual communities, technical work, social media, and customer support. Models must handle mixed vocabulary, mixed syntax, and mixed scripts.
Training data for code-switching is often limited. Useful methods include multilingual pretraining, synthetic code-switched data, and evaluation sets built from real user text.
Evaluation
Cross-lingual systems should be evaluated separately for each language. A single aggregate score can hide poor performance on low-resource languages.
Important evaluation dimensions include:
| Dimension | Question |
|---|---|
| Per-language accuracy | Which languages work well or poorly? |
| Zero-shot transfer | Does training on source language generalize? |
| Few-shot adaptation | How much target data is needed? |
| Domain robustness | Does it work outside benchmark text? |
| Token efficiency | How many tokens are needed per input? |
| Script coverage | Does performance differ by writing system? |
| Fairness | Are some groups or dialects underserved? |
For cross-lingual retrieval, evaluate recall@k and nDCG by language pair. For cross-lingual QA, evaluate both answer correctness and evidence grounding.
Practical Design Choices
A practical cross-lingual system should choose among several strategies.
| Setting | Recommended approach |
|---|---|
| No target labels | Zero-shot multilingual fine-tuning |
| Small target labels | Few-shot target adaptation |
| Good machine translation available | Translate-train or translate-test |
| Multilingual search | Dense multilingual retrieval plus sparse exact matching |
| High-stakes domain | Human-reviewed translation and target-language evaluation |
| Code-switching users | Include real mixed-language examples |
| Low-resource language | Collect targeted data and inspect tokenizer behavior |
There is no universal best method. The right choice depends on data availability, translation quality, latency, risk, and language coverage.
Common Failure Modes
Cross-lingual systems fail in predictable ways.
| Failure mode | Description |
|---|---|
| Literal translation error | Idioms or sentiment are translated incorrectly |
| Entity corruption | Names are mistranslated or normalized wrongly |
| Script mismatch | Model fails on non-Latin text |
| Token explosion | Input becomes too long after tokenization |
| Cultural mismatch | Model misses local meaning or context |
| Domain mismatch | Pretraining lacks target-domain language |
| False semantic match | Dense retriever returns related but wrong passages |
| Evaluation masking | Average score hides weak languages |
| Code-switching failure | Mixed-language input is parsed incorrectly |
Error analysis should be done by language and by task type. A model that works for English and French may fail badly for Vietnamese legal documents or Arabic social media.
Summary
Cross-lingual transfer allows knowledge learned in one language to support tasks in another language. It relies on multilingual tokenization, shared representation spaces, translation, and task-level alignment.
Zero-shot transfer is useful when no target-language labels exist. Few-shot transfer is usually stronger when small labeled datasets are available. Translate-train and translate-test are practical alternatives when machine translation quality is good.
Reliable cross-lingual systems require per-language evaluation, attention to tokenization, domain-specific testing, and explicit handling of low-resource languages, code-switching, and entity preservation.