Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus
Summary
A new benchmark, Cross-lingual BrowseComp-Plus (XBCP), evaluates deep research agents and multilingual retrievers in cross-lingual settings. XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages, including high- and low-resource ones, while keeping English questions and answers. Experiments with four agents (GPT-OSS-20B, GPT-OSS-120B, Qwen3.6-35B-A3B, DeepSeek-V4-Pro) and five retrievers (BM25, Qwen3-Embedding-4B/8B, Multilingual-E5-Large, Arctic-Embed-L-2.0) reveal substantial performance degradation. Accuracy drops by 16-23 percentage points, evidence recall decreases, agents become less calibrated, and citation fidelity reduces when evidence is translated. Even with oracle retrieval, accuracy remains lower, indicating an agent-side difficulty in integrating language-mismatched evidence. Low-resource language penalties are primarily attributed to retrieval failures rather than agent reasoning.
Key takeaway
For AI Architects or NLP Engineers designing deep research agents for global applications, recognize that cross-lingual performance is not merely a retrieval problem. You must address both retrieval and agent-side evidence integration bottlenecks. Focus on developing language-aware agentic search systems that can dynamically adapt to evidence languages, and invest in multilingual pretraining to enhance agents' intrinsic reasoning capabilities over diverse linguistic inputs.
Key insights
Cross-lingual deep research agents face dual bottlenecks: retrieval failure and agent-side evidence integration.
Principles
- Language mismatch significantly degrades agent accuracy and evidence recall.
- Low-resource language penalties primarily stem from retrieval failures.
- English serves as the agent's "native language" for instruction following.
Method
XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages for cross-lingual and multilingual evaluation of deep research agents.
In practice
- Evaluate cross-lingual retrievers within iterative agent search loops.
- Utilize agent reasoning traces for query expansion to improve retrieval.
- Prioritize stronger multilingual pretraining for agents over prompt translation.
Topics
- Deep Research Agents
- Cross-lingual Retrieval
- Multilingual LLMs
- Benchmarking
- Evidence Integration
- Information Retrieval
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.