Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus
Summary
A new benchmark, XBCP (Cross-lingual BrowseComp-Plus), has been introduced to evaluate deep research agents and retrievers in scenarios where supporting evidence is not in the same language as the user's query. Unlike existing browsing benchmarks that assume monolingual evidence, XBCP maintains an English question-and-answer space while varying the language of supporting documents. It features two settings: a cross-lingual setting where each query is paired with evidence in a single assigned language, and a multilingual setting distributing the evidence corpus across 12 languages. Evaluations of four deep research agents using sparse and dense multilingual retrievers revealed significant performance degradation when evidence is translated. Even strong dense retrievers experienced reduced evidence recall, and agents exhibited decreased calibration and less reliable citation fidelity. Notably, accuracy remained lower even when all gold evidence was directly provided, indicating both retrieval failures and an independent agent-side challenge in integrating language-mismatched evidence.
Key takeaway
For AI Scientists and Machine Learning Engineers developing deep research agents for global information retrieval, you must recognize that current systems exhibit substantial performance degradation with cross-lingual evidence. Your development efforts should prioritize enhancing multilingual retriever robustness and improving agent-side mechanisms for integrating language-mismatched information. This is crucial even when retrieval is perfect, as agents struggle independently with diverse language inputs.
Key insights
Cross-lingual deep research significantly degrades agent performance, revealing both retrieval and agent-side evidence integration challenges.
Principles
- Cross-lingual evidence degrades agent performance.
- Retrieval failures are distinct from agent integration issues.
- Even strong retrievers lose recall cross-lingually.
Method
XBCP evaluates deep research agents by preserving English Q&A while varying evidence language across cross-lingual (single language) and multilingual (12 languages) settings, measuring accuracy, recall, and calibration.
In practice
- Test agents with diverse language evidence.
- Prioritize cross-lingual retrieval robustness.
- Improve agent logic for language-mismatched data.
Topics
- Deep Research Agents
- Multilingual Retrieval
- Cross-lingual Evaluation
- BrowseComp-Plus
- Language Mismatch
- Information Retrieval Benchmarks
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.