Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
Summary
A novel coreference resolution (CR) pipeline significantly enhances performance in low-resource languages by leveraging machine translation (MT) to generate or expand training data. The system, which extends the Maverick model with a multilingual encoder (mmBERT-base), utilizes Claude Sonnet 4.6 to translate English annotated samples to a target language and then back to English. The quality of these translated samples is automatically validated via BERTScore, which measures cosine similarity between the original and back-translated English texts in a BERT model's latent space. This similarity score is integrated into the loss function as a weighting factor (s^p) during training. Extensive experiments across French, Hungarian, Romanian, and Russian demonstrate substantial performance gains, enabling accurate CR even in Romanian, where no prior corpora existed.
Key takeaway
For NLP Engineers developing coreference resolution systems in low-resource languages, you should consider implementing a cycle-consistent machine translation pipeline. This approach allows you to generate or augment training data effectively, even for languages with no existing corpora, significantly boosting CR performance. By weighting training samples based on back-translation quality, you can mitigate noise from translation artifacts and achieve higher precision in your models.
Key insights
Coreference resolution in low-resource languages can be significantly improved by cycle-consistent machine translation data augmentation.
Principles
- Weighting translated training samples by back-translation cycle consistency improves model precision.
- Multilingual encoders enable a single CR model across diverse languages.
- Zero-shot LLMs lag specialized CR models by 10-20% F1 on benchmarks like CoNLL-2012/OntoNotes.
Method
The pipeline translates English CR data to a target language, back-translates it to English, computes BERTScore for cycle consistency, and weights the CR model's loss function with s^p during training.
In practice
- Generate CR training data for languages lacking resources using LLM-based MT.
- Expand existing small CR datasets with cycle-consistent translated samples.
- Employ BERTScore over BLEU for semantic similarity in back-translation quality assessment.
Topics
- Coreference Resolution
- Low-Resource Languages
- Machine Translation
- Data Augmentation
- BERTScore
- Claude Sonnet 4.6
- Maverick Model
Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.