LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
Summary
LoHoSearch is a new, challenging benchmark for long-horizon search agents, designed to overcome the difficulty ceiling of human-authored benchmarks like BrowseComp, which models rapidly saturated (exceeding 90% accuracy). Developed by Meituan, LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline systematically maximizes search space size and structural complexity. Evaluations show that even the strongest model, GPT-5.5, achieves only 34.74% accuracy on LoHoSearch. Furthermore, existing context management strategies yield only a 6.8% improvement, significantly less than on prior benchmarks, and correct trajectories require 1.7x more tool calls than BrowseComp.
Key takeaway
For AI Engineers developing search agents, or Directors of AI/ML evaluating agent performance, current benchmarks like BrowseComp no longer provide a discriminative measure of capability. You should prioritize research into novel context management and reasoning architectures, as existing strategies offer minimal gains on complex, long-horizon tasks. Relying on older benchmarks will lead to inflated performance estimates and hinder progress in true agentic intelligence.
Key insights
Automated, knowledge graph-driven benchmark generation creates significantly harder search tasks for advanced AI agents.
Principles
- Human-authored benchmarks inherently limit difficulty.
- Search difficulty increases with search space size.
- Structural complexity prevents problem decomposition.
Method
An automated pipeline constructs a knowledge graph, samples structurally complex subgraphs, generates natural language questions, and verifies uniqueness and difficulty.
In practice
- Evaluate search agents on LoHoSearch for true capabilities.
- Develop novel context management strategies.
Topics
- LoHoSearch
- Search Agents
- Benchmark Design
- Knowledge Graphs
- Context Management
- Long-Horizon Reasoning
Best for: Research Scientist, AI Scientist, AI Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.