Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A new benchmark, Cross-lingual BrowseComp-Plus (XBCP), evaluates deep research agents and multilingual retrievers in cross-lingual settings. XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages, including high- and low-resource ones, while keeping English questions and answers. Experiments with four agents (GPT-OSS-20B, GPT-OSS-120B, Qwen3.6-35B-A3B, DeepSeek-V4-Pro) and five retrievers (BM25, Qwen3-Embedding-4B/8B, Multilingual-E5-Large, Arctic-Embed-L-2.0) reveal substantial performance degradation. Accuracy drops by 16-23 percentage points, evidence recall decreases, agents become less calibrated, and citation fidelity reduces when evidence is translated. Even with oracle retrieval, accuracy remains lower, indicating an agent-side difficulty in integrating language-mismatched evidence. Low-resource language penalties are primarily attributed to retrieval failures rather than agent reasoning.

Key takeaway

For AI Architects or NLP Engineers designing deep research agents for global applications, recognize that cross-lingual performance is not merely a retrieval problem. You must address both retrieval and agent-side evidence integration bottlenecks. Focus on developing language-aware agentic search systems that can dynamically adapt to evidence languages, and invest in multilingual pretraining to enhance agents' intrinsic reasoning capabilities over diverse linguistic inputs.

Key insights

Cross-lingual deep research agents face dual bottlenecks: retrieval failure and agent-side evidence integration.

Principles

Language mismatch significantly degrades agent accuracy and evidence recall.
Low-resource language penalties primarily stem from retrieval failures.
English serves as the agent's "native language" for instruction following.

Method

XBCP extends BrowseComp-Plus by translating its evidence corpus into 12 languages for cross-lingual and multilingual evaluation of deep research agents.

In practice

Evaluate cross-lingual retrievers within iterative agent search loops.
Utilize agent reasoning traces for query expansion to improve retrieval.
Prioritize stronger multilingual pretraining for agents over prompt translation.

Topics

Deep Research Agents
Cross-lingual Retrieval
Multilingual LLMs
Benchmarking
Evidence Integration
Information Retrieval

Code references

Alibaba-NLP/DeepResearch

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.