Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A new benchmark, Cross-lingual BrowseComp-Plus (XBCP), evaluates deep research agents and multilingual retrievers in cross-lingual settings. XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages, including high- and low-resource ones, while keeping English questions and answers. Experiments with four agents (GPT-OSS-20B, GPT-OSS-120B, Qwen3.6-35B-A3B, DeepSeek-V4-Pro) and five retrievers (BM25, Qwen3-Embedding-4B/8B, Multilingual-E5-Large, Arctic-Embed-L-2.0) reveal substantial performance degradation. Accuracy drops by 16-23 percentage points, evidence recall decreases, agents become less calibrated, and citation fidelity reduces when evidence is translated. Even with oracle retrieval, accuracy remains lower, indicating an agent-side difficulty in integrating language-mismatched evidence. Low-resource language penalties are primarily attributed to retrieval failures rather than agent reasoning.

Key takeaway

For AI Architects or NLP Engineers designing deep research agents for global applications, recognize that cross-lingual performance is not merely a retrieval problem. You must address both retrieval and agent-side evidence integration bottlenecks. Focus on developing language-aware agentic search systems that can dynamically adapt to evidence languages, and invest in multilingual pretraining to enhance agents' intrinsic reasoning capabilities over diverse linguistic inputs.

Key insights

Cross-lingual deep research agents face dual bottlenecks: retrieval failure and agent-side evidence integration.

Principles

Method

XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages for cross-lingual and multilingual evaluation of deep research agents.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.