Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new benchmark, XBCP (Cross-lingual BrowseComp-Plus), has been introduced to evaluate deep research agents and retrievers in scenarios where supporting evidence is not in the same language as the user's query. Unlike existing browsing benchmarks that assume monolingual evidence, XBCP maintains an English question-and-answer space while varying the language of supporting documents. It features two settings: a cross-lingual setting where each query is paired with evidence in a single assigned language, and a multilingual setting distributing the evidence corpus across 12 languages. Evaluations of four deep research agents using sparse and dense multilingual retrievers revealed significant performance degradation when evidence is translated. Even strong dense retrievers experienced reduced evidence recall, and agents exhibited decreased calibration and less reliable citation fidelity. Notably, accuracy remained lower even when all gold evidence was directly provided, indicating both retrieval failures and an independent agent-side challenge in integrating language-mismatched evidence.

Key takeaway

For AI Scientists and Machine Learning Engineers developing deep research agents for global information retrieval, you must recognize that current systems exhibit substantial performance degradation with cross-lingual evidence. Your development efforts should prioritize enhancing multilingual retriever robustness and improving agent-side mechanisms for integrating language-mismatched information. This is crucial even when retrieval is perfect, as agents struggle independently with diverse language inputs.

Key insights

Cross-lingual deep research significantly degrades agent performance, revealing both retrieval and agent-side evidence integration challenges.

Principles

Method

XBCP evaluates deep research agents by preserving English Q&A while varying evidence language across cross-lingual (single language) and multilingual (12 languages) settings, measuring accuracy, recall, and calibration.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.