Lost in Translation: Do LVLM Judges Generalize Across Languages?
Summary
Researchers have introduced MM-JudgeBench, the first large-scale benchmark designed for evaluating multilingual and multimodal judge models, specifically large vision-language models (LVLMs). This benchmark comprises over 60,000 pairwise preference instances across 25 typologically diverse languages. MM-JudgeBench includes a general vision-language preference evaluation subset, an extension of VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA. The project also releases a multilingual training set from MM-RewardBench, distinct from the evaluation data, to facilitate domain adaptation. Evaluating 22 LVLMs (15 open-source, 7 proprietary) on MM-JudgeBench revealed significant cross-lingual performance variance. The analysis indicates that model size and architecture are unreliable predictors of multilingual robustness, with even advanced LVLM judges demonstrating inconsistent behavior across different languages.
Key takeaway
For research scientists developing or deploying LVLM judge models, you should prioritize multilingual evaluation using benchmarks like MM-JudgeBench. Relying solely on English-centric assessments or assuming model size guarantees robustness will lead to unreliable cross-lingual performance. Integrate multilingual training data to improve generalization and ensure consistent behavior across diverse linguistic contexts.
Key insights
LVLM judge models exhibit substantial cross-lingual performance variance, challenging current evaluation practices.
Principles
- Multilingual benchmarks are crucial for reliable automated evaluators.
- Model size does not predict multilingual robustness.
Method
MM-JudgeBench integrates general vision-language preference evaluation with chart-centric visual-text reasoning across 25 languages, using 60K pairwise instances and a separate training set for domain adaptation.
In practice
- Use MM-JudgeBench for LVLM multilingual evaluation.
- Consider multilingual training for LVLM judges.
Topics
- LVLM Judges
- Multilingual Evaluation
- Multimodal Benchmarks
- Reward Models
- Cross-lingual Generalization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.