Just Use XML: Revisiting Joint Translation and Label Projection

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

LabelPigeon is a new framework that jointly performs machine translation and label projection using XML tags, challenging the prior assumption that combining these tasks degrades translation quality. The framework, which fine-tunes the NLLB-200 3.3B model on a modified Salesforce Localization XML MT dataset, demonstrates improved translation quality across 203 languages and superior label projection accuracy. LabelPigeon outperforms baselines like Awesome-align, Gemma 3 27B IT, and EasyProject in direct label projection evaluations on XQuAD and MLQA datasets, achieving up to +39.9 F1 score improvement in downstream tasks like Named Entity Recognition (NER). It also shows consistent gains in cross-lingual transfer across 27 languages and three downstream NLP tasks, including Question Answering (QA) and Coreference Resolution (CR), without incurring additional computational overhead at inference.

Key takeaway

Research Scientists developing multilingual NLP applications should reconsider multi-stage pipelines for label projection. By adopting LabelPigeon's XML-tag-based joint translation and label projection, you can achieve superior cross-lingual transfer performance, particularly for span-level tasks like NER, while simultaneously improving translation quality and reducing engineering complexity. Focus on fine-tuning with high-quality, XML-tagged parallel data to maximize these benefits.

Key insights

Joint translation and label projection with XML tags can improve both translation quality and label transfer.

Principles

Method

Fine-tune a base translation model (NLLB-200 3.3B) on XML-tagged parallel corpora, then use an off-the-shelf XML parser to extract tags after translation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.