XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new dataset, XQ-MEval, has been developed to benchmark translation metrics for multilingual systems, addressing the issue of cross-lingual scoring bias. This dataset, covering nine translation directions, was semi-automatically constructed by injecting MQM-defined errors into gold translations, filtering them with native speakers, and merging errors to create pseudo translations with controlled quality. These pseudo translations are then used in triplets with sources and references to evaluate translation metrics. Experiments with XQ-MEval on nine representative metrics demonstrate inconsistencies between averaged metric scores and human judgment, providing empirical evidence of cross-lingual scoring bias. The research also proposes a normalization strategy, derived from XQ-MEval, to align score distributions across languages, enhancing the fairness and reliability of multilingual metric evaluation.

Key takeaway

For research scientists developing or evaluating multilingual translation systems, understanding and mitigating cross-lingual scoring bias is crucial. Your current practice of averaging metric scores across languages may be unreliable. You should consider using datasets like XQ-MEval to empirically test for this bias and implement normalization strategies to ensure more fair and reliable evaluations of translation quality across diverse languages.

Key insights

Cross-lingual scoring bias in translation metrics can be identified and mitigated using controlled, parallel-quality datasets.

Principles

Averaging metric scores across languages is unreliable.
Controlled error injection can simulate translation quality.
Normalization improves multilingual metric fairness.

Method

XQ-MEval constructs pseudo translations by injecting MQM-defined errors into gold translations, filtering with native speakers, and merging errors to control quality for metric benchmarking.

In practice

Use XQ-MEval to benchmark new translation metrics.
Apply normalization strategies to multilingual metric scores.

Topics

XQ-MEval Dataset
Translation Metrics
Cross-lingual Scoring Bias
Multilingual Translation
MQM Errors

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.