Skip to content

Cross-Lingual Transfer

Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language.

Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language. It is important because labeled data is unevenly distributed across languages. English has many datasets and benchmarks. Many other languages have limited annotation, limited digital text, or domain-specific data that is expensive to label.

The goal is to share knowledge across languages. A model may learn sentiment classification, named entity recognition, question answering, or retrieval from high-resource languages, then apply that knowledge to lower-resource languages.

A cross-lingual model learns a function

fθ(x,)y, f_\theta(x, \ell) \rightarrow y,

where xx is the input text, \ell is the language, and yy is the task output. In many systems, the language ID is implicit rather than explicitly provided.

Why Cross-Lingual Transfer Works

Cross-lingual transfer works because many languages share structure at several levels.

At the semantic level, different languages can express the same meaning:

English: The movie was excellent.
Vietnamese: Bộ phim rất xuất sắc.
French: Le film était excellent.

At the task level, the labels may be shared. The sentiment label is positive in all three cases.

At the representation level, multilingual models learn embeddings where similar meanings across languages are close together. This allows a classifier trained on one language to generalize to another.

The model does not need identical words. It needs representations that preserve meaning and task-relevant distinctions.

Multilingual Tokenization

Cross-lingual models usually rely on shared tokenization. A multilingual tokenizer contains subword units from many languages.

For example, a sentence is converted into token IDs:

input_ids      # [B, T]
attention_mask # [B, T]

The vocabulary may contain Latin characters, accents, CJK characters, Arabic script, Cyrillic script, punctuation, numbers, and frequent subwords from many languages.

Subword tokenization helps with rare words. A word unseen during training can be broken into smaller known pieces.

However, tokenization quality varies by language. Some languages require more tokens per word or sentence. This increases sequence length and computation. It can also reduce model quality because information is fragmented across many subwords.

Shared Embedding Spaces

A multilingual model maps text from different languages into a shared representation space.

For an input sequence,

x=(x1,x2,,xT), x = (x_1, x_2, \ldots, x_T),

the encoder produces hidden states

HRT×D. H \in \mathbb{R}^{T \times D}.

If the model is well aligned, sentences with similar meanings have similar representations even when written in different languages.

For sentence-level tasks, we may pool hidden states into a single vector:

h=pool(H). h = \operatorname{pool}(H).

A classifier then predicts:

z=Wh+b. z = Wh + b.

The same classifier head can be applied across languages if the representations are aligned.

Zero-Shot Cross-Lingual Transfer

Zero-shot transfer means training on one language and evaluating on another language without labeled examples in the target language.

A common setup is:

  1. Fine-tune a multilingual model on English labeled data.
  2. Evaluate directly on Vietnamese, Arabic, Swahili, Hindi, or another target language.

For example:

Train: English sentiment dataset
Test: Vietnamese sentiment dataset

The model succeeds only if the multilingual encoder maps English and Vietnamese inputs into compatible task representations.

Zero-shot transfer is useful when target-language labels are unavailable. It is also fragile. Performance depends on language similarity, tokenizer coverage, pretraining data, domain match, and task difficulty.

Few-Shot Cross-Lingual Transfer

Few-shot transfer uses a small number of labeled examples in the target language.

A typical procedure:

  1. Start with a multilingual pretrained model.
  2. Fine-tune on a high-resource language.
  3. Continue fine-tuning on a small target-language dataset.

Few-shot adaptation often gives large gains over zero-shot transfer. Even a few hundred examples can teach the model target-language label conventions, domain vocabulary, and annotation style.

Few-shot transfer is especially useful when labels are cheap to obtain in small quantities but expensive at scale.

Translate-Train and Translate-Test

Machine translation provides another route to cross-lingual transfer.

In translate-train, labeled examples from a source language are translated into the target language. The model is trained on the translated data.

English labeled data
→ translate to Vietnamese
→ train Vietnamese classifier

In translate-test, target-language inputs are translated into the source language. A source-language model is then applied.

Vietnamese test input
→ translate to English
→ English classifier

Translate-train can create large synthetic labeled datasets. Translate-test can be useful when a strong source-language model already exists.

Both methods depend on translation quality. Translation may distort named entities, sentiment, domain terms, legal meaning, or cultural references. Errors in translation become errors in supervision or inference.

Cross-Lingual Named Entity Recognition

Named entity recognition is difficult in cross-lingual settings because entity boundaries and surface forms differ across languages.

Consider:

English: President Macron visited Hanoi.
Vietnamese: Tổng thống Macron đã thăm Hà Nội.

The entity Macron transfers easily. The location Hà Nội includes accents and tokenization differences.

Cross-lingual NER must handle:

IssueExample
Different scriptsEnglish, Arabic, Chinese, Hindi
TransliterationMoscow, Moskva, Москва
InflectionNames change form by case or suffix
Token boundariesSome languages lack whitespace segmentation
Local entitiesTarget language contains entities absent from source data

A common method is multilingual encoder fine-tuning with BIO labels. Another method projects labels through word alignment from translated text.

Cross-Lingual Retrieval

Cross-lingual retrieval finds documents in one language using queries in another language.

Example:

Query: symptoms of dengue fever
Documents: Vietnamese medical articles

The query and documents must be represented in a shared embedding space.

A dense retriever computes:

hq=fθ(qEnglish) h_q = f_\theta(q_{\text{English}}) hd=gθ(dVietnamese) h_d = g_\theta(d_{\text{Vietnamese}})

and ranks documents by

s(q,d)=hqhd. s(q,d) = h_q^\top h_d.

Cross-lingual retrieval is useful for multilingual search, international legal research, patent search, news monitoring, and RAG over multilingual corpora.

Hybrid retrieval can help. Dense retrieval handles semantic transfer. Sparse retrieval handles names, numbers, citations, and exact terms.

Cross-Lingual Question Answering

Cross-lingual question answering may involve several language configurations.

Question languageContext languageAnswer language
EnglishEnglishEnglish
EnglishVietnameseEnglish
VietnameseEnglishVietnamese
VietnameseVietnameseVietnamese
Englishmany languagesEnglish

The system may retrieve evidence in multiple languages, translate passages, or generate an answer in the user’s language.

A retrieval-augmented cross-lingual QA pipeline may look like:

User question
→ multilingual retrieval
→ passage translation or direct multilingual reading
→ answer generation
→ citation preservation

The system must preserve source grounding. If evidence is in one language and the answer is in another, citations and translated claims must remain aligned.

Language Imbalance

Multilingual models are affected by language imbalance. High-resource languages dominate pretraining data. Low-resource languages receive fewer updates and often have worse tokenization.

This leads to several effects:

EffectConsequence
Poor tokenizationMore tokens per sentence
Weak representationsLower downstream accuracy
Script biasSome scripts receive weaker coverage
Domain gapsFormal text works better than colloquial text
Cultural gapsIdioms and local references are misunderstood

Language imbalance is not just a data quantity problem. It also affects vocabulary construction, benchmark design, and evaluation fairness.

Code-Switching

Many users mix languages in the same sentence or conversation.

Example:

Deploy model này lên GPU server như thế nào?

This sentence mixes English technical terms with Vietnamese grammar.

Code-switching is common in multilingual communities, technical work, social media, and customer support. Models must handle mixed vocabulary, mixed syntax, and mixed scripts.

Training data for code-switching is often limited. Useful methods include multilingual pretraining, synthetic code-switched data, and evaluation sets built from real user text.

Evaluation

Cross-lingual systems should be evaluated separately for each language. A single aggregate score can hide poor performance on low-resource languages.

Important evaluation dimensions include:

DimensionQuestion
Per-language accuracyWhich languages work well or poorly?
Zero-shot transferDoes training on source language generalize?
Few-shot adaptationHow much target data is needed?
Domain robustnessDoes it work outside benchmark text?
Token efficiencyHow many tokens are needed per input?
Script coverageDoes performance differ by writing system?
FairnessAre some groups or dialects underserved?

For cross-lingual retrieval, evaluate recall@k and nDCG by language pair. For cross-lingual QA, evaluate both answer correctness and evidence grounding.

Practical Design Choices

A practical cross-lingual system should choose among several strategies.

SettingRecommended approach
No target labelsZero-shot multilingual fine-tuning
Small target labelsFew-shot target adaptation
Good machine translation availableTranslate-train or translate-test
Multilingual searchDense multilingual retrieval plus sparse exact matching
High-stakes domainHuman-reviewed translation and target-language evaluation
Code-switching usersInclude real mixed-language examples
Low-resource languageCollect targeted data and inspect tokenizer behavior

There is no universal best method. The right choice depends on data availability, translation quality, latency, risk, and language coverage.

Common Failure Modes

Cross-lingual systems fail in predictable ways.

Failure modeDescription
Literal translation errorIdioms or sentiment are translated incorrectly
Entity corruptionNames are mistranslated or normalized wrongly
Script mismatchModel fails on non-Latin text
Token explosionInput becomes too long after tokenization
Cultural mismatchModel misses local meaning or context
Domain mismatchPretraining lacks target-domain language
False semantic matchDense retriever returns related but wrong passages
Evaluation maskingAverage score hides weak languages
Code-switching failureMixed-language input is parsed incorrectly

Error analysis should be done by language and by task type. A model that works for English and French may fail badly for Vietnamese legal documents or Arabic social media.

Summary

Cross-lingual transfer allows knowledge learned in one language to support tasks in another language. It relies on multilingual tokenization, shared representation spaces, translation, and task-level alignment.

Zero-shot transfer is useful when no target-language labels exist. Few-shot transfer is usually stronger when small labeled datasets are available. Translate-train and translate-test are practical alternatives when machine translation quality is good.

Reliable cross-lingual systems require per-language evaluation, attention to tokenization, domain-specific testing, and explicit handling of low-resource languages, code-switching, and entity preservation.