The BEST Deep Research AI is ...

2026-05-24 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

Peking University's new "Deep Web Benchmark," published May 2026, evaluates AI agents for complex deep web research. It demands massive cross-source evidence and long-horizon derivation. Structured as an 8x8 matrix, the benchmark assesses four capability families: retrieval, multi-step derivation, cross-source conflict resolution (calibration), and reasoning. Initial results show Codex CLI OpenAI GPT-5.5 and Claude Opus 4.7 achieved the highest overall scores, both at 31.84%. DeepSeek v4 Pro and GLM 5.1 surprisingly outperformed Claude Sonnet 4.6. Key findings indicate retrieval is not the primary bottleneck. Instead, derivation accuracy and calibration behavior account for nearly 70% of failures. Models also exhibit significant per-task performance variation, with top models like Claude Opus 4.7 ranging from 3.9% to 85% success. Weaker models demonstrate higher hallucination rates.

Key takeaway

For AI Scientists and ML Engineers developing or deploying deep research agents, recognize that retrieval is rarely the bottleneck. Instead, prioritize improving multi-step derivation accuracy and cross-source calibration behavior, which account for nearly 70% of failures. Given the significant per-task performance variation, run critical queries multiple times (e.g., 10-100) to mitigate statistical fluctuations and ensure reliable results from models like Claude Opus 4.7 or GPT-5.5.

Key insights

AI deep research performance is bottlenecked by multi-step derivation and cross-source calibration, not retrieval.

Principles

Retrieval is not the bottleneck for deep web research.
Derivation and calibration cause ~70% of AI failures.
Model performance varies widely per task.

Method

The Deep Web Benchmark assesses AI agents using an 8x8 matrix across four capability families: retrieval, multi-step derivation, cross-source calibration, and complex reasoning.

In practice

Run AI agent queries multiple times (10-100) for reliability.
Focus on improving derivation and calibration in AI agents.
Select models based on specific task specializations.

Topics

Deep Research AI
AI Benchmarking
LLM Agents
Model Context Protocol
Hallucination Resistance
Derivation Accuracy

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.