The BEST Deep Research AI is ...
Summary
Peking University's new "Deep Web Benchmark," published May 2026, evaluates AI agents for complex deep web research. It demands massive cross-source evidence and long-horizon derivation. Structured as an 8x8 matrix, the benchmark assesses four capability families: retrieval, multi-step derivation, cross-source conflict resolution (calibration), and reasoning. Initial results show Codex CLI OpenAI GPT-5.5 and Claude Opus 4.7 achieved the highest overall scores, both at 31.84%. DeepSeek v4 Pro and GLM 5.1 surprisingly outperformed Claude Sonnet 4.6. Key findings indicate retrieval is not the primary bottleneck. Instead, derivation accuracy and calibration behavior account for nearly 70% of failures. Models also exhibit significant per-task performance variation, with top models like Claude Opus 4.7 ranging from 3.9% to 85% success. Weaker models demonstrate higher hallucination rates.
Key takeaway
For AI Scientists and ML Engineers developing or deploying deep research agents, recognize that retrieval is rarely the bottleneck. Instead, prioritize improving multi-step derivation accuracy and cross-source calibration behavior, which account for nearly 70% of failures. Given the significant per-task performance variation, run critical queries multiple times (e.g., 10-100) to mitigate statistical fluctuations and ensure reliable results from models like Claude Opus 4.7 or GPT-5.5.
Key insights
AI deep research performance is bottlenecked by multi-step derivation and cross-source calibration, not retrieval.
Principles
- Retrieval is not the bottleneck for deep web research.
- Derivation and calibration cause ~70% of AI failures.
- Model performance varies widely per task.
Method
The Deep Web Benchmark assesses AI agents using an 8x8 matrix across four capability families: retrieval, multi-step derivation, cross-source calibration, and complex reasoning.
In practice
- Run AI agent queries multiple times (10-100) for reliability.
- Focus on improving derivation and calibration in AI agents.
- Select models based on specific task specializations.
Topics
- Deep Research AI
- AI Benchmarking
- LLM Agents
- Model Context Protocol
- Hallucination Resistance
- Derivation Accuracy
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.