XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Summary
A new dataset, XQ-MEval, has been developed to benchmark translation metrics for multilingual systems, addressing the issue of cross-lingual scoring bias. This dataset, covering nine translation directions, was semi-automatically constructed by injecting MQM-defined errors into gold translations, filtering them with native speakers, and merging errors to create pseudo translations with controlled quality. These pseudo translations are then used in triplets with sources and references to evaluate translation metrics. Experiments with XQ-MEval on nine representative metrics demonstrate inconsistencies between averaged metric scores and human judgment, providing empirical evidence of cross-lingual scoring bias. The research also proposes a normalization strategy, derived from XQ-MEval, to align score distributions across languages, enhancing the fairness and reliability of multilingual metric evaluation.
Key takeaway
For research scientists developing or evaluating multilingual translation systems, understanding and mitigating cross-lingual scoring bias is crucial. Your current practice of averaging metric scores across languages may be unreliable. You should consider using datasets like XQ-MEval to empirically test for this bias and implement normalization strategies to ensure more fair and reliable evaluations of translation quality across diverse languages.
Key insights
Cross-lingual scoring bias in translation metrics can be identified and mitigated using controlled, parallel-quality datasets.
Principles
- Averaging metric scores across languages is unreliable.
- Controlled error injection can simulate translation quality.
- Normalization improves multilingual metric fairness.
Method
XQ-MEval constructs pseudo translations by injecting MQM-defined errors into gold translations, filtering with native speakers, and merging errors to control quality for metric benchmarking.
In practice
- Use XQ-MEval to benchmark new translation metrics.
- Apply normalization strategies to multilingual metric scores.
Topics
- XQ-MEval Dataset
- Translation Metrics
- Cross-lingual Scoring Bias
- Multilingual Translation
- MQM Errors
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.