Just Use XML: Revisiting Joint Translation and Label Projection
Summary
LabelPigeon is a new framework that jointly performs machine translation and label projection using XML tags, challenging the prior assumption that combining these tasks degrades translation quality. The framework, which fine-tunes the NLLB-200 3.3B model on a modified Salesforce Localization XML MT dataset, demonstrates improved translation quality across 203 languages and superior label projection accuracy. LabelPigeon outperforms baselines like Awesome-align, Gemma 3 27B IT, and EasyProject in direct label projection evaluations on XQuAD and MLQA datasets, achieving up to +39.9 F1 score improvement in downstream tasks like Named Entity Recognition (NER). It also shows consistent gains in cross-lingual transfer across 27 languages and three downstream NLP tasks, including Question Answering (QA) and Coreference Resolution (CR), without incurring additional computational overhead at inference.
Key takeaway
Research Scientists developing multilingual NLP applications should reconsider multi-stage pipelines for label projection. By adopting LabelPigeon's XML-tag-based joint translation and label projection, you can achieve superior cross-lingual transfer performance, particularly for span-level tasks like NER, while simultaneously improving translation quality and reducing engineering complexity. Focus on fine-tuning with high-quality, XML-tagged parallel data to maximize these benefits.
Key insights
Joint translation and label projection with XML tags can improve both translation quality and label transfer.
Principles
- XML tags enable direct correspondence for span labels.
- Fine-tuning improves translation quality and label projection.
- Less idiomatic translations do not imply substantial quality loss.
Method
Fine-tune a base translation model (NLLB-200 3.3B) on XML-tagged parallel corpora, then use an off-the-shelf XML parser to extract tags after translation.
In practice
- Use XML tags for span-level annotations.
- Fine-tune on high-quality, XML-tagged parallel data.
- Prioritize three high-resource languages for fine-tuning.
Topics
- Label Projection
- Joint Translation
- XML Tagging
- Cross-lingual Transfer
- Named Entity Recognition
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.