Lost in Translation: Do LVLM Judges Generalize Across Languages?

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Researchers have introduced MM-JudgeBench, the first large-scale benchmark designed for evaluating multilingual and multimodal judge models, specifically large vision-language models (LVLMs). This benchmark comprises over 60,000 pairwise preference instances across 25 typologically diverse languages. MM-JudgeBench includes a general vision-language preference evaluation subset, an extension of VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA. The project also releases a multilingual training set from MM-RewardBench, distinct from the evaluation data, to facilitate domain adaptation. Evaluating 22 LVLMs (15 open-source, 7 proprietary) on MM-JudgeBench revealed significant cross-lingual performance variance. The analysis indicates that model size and architecture are unreliable predictors of multilingual robustness, with even advanced LVLM judges demonstrating inconsistent behavior across different languages.

Key takeaway

For research scientists developing or deploying LVLM judge models, you should prioritize multilingual evaluation using benchmarks like MM-JudgeBench. Relying solely on English-centric assessments or assuming model size guarantees robustness will lead to unreliable cross-lingual performance. Integrate multilingual training data to improve generalization and ensure consistent behavior across diverse linguistic contexts.

Key insights

LVLM judge models exhibit substantial cross-lingual performance variance, challenging current evaluation practices.

Principles

Multilingual benchmarks are crucial for reliable automated evaluators.
Model size does not predict multilingual robustness.

Method

MM-JudgeBench integrates general vision-language preference evaluation with chart-centric visual-text reasoning across 25 languages, using 60K pairwise instances and a separate training set for domain adaptation.

In practice

Use MM-JudgeBench for LVLM multilingual evaluation.
Consider multilingual training for LVLM judges.

Topics

LVLM Judges
Multilingual Evaluation
Multimodal Benchmarks
Reward Models
Cross-lingual Generalization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.