LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
Summary
LoHoSearch, a new benchmark for long-horizon search agents, addresses the saturation of existing benchmarks like BrowseComp, where top models exceed 90% accuracy. Prior human-authored benchmarks face a difficulty ceiling due to annotators' inability to systematically maximize search space size and structural complexity. LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline selects relations with large search spaces and assembles structurally complex questions with KG-verified unique answers. Initial evaluations reveal that even the strongest models achieve only 34.74% accuracy on LoHoSearch, and existing context management strategies show significantly smaller gains (best +6.8%) compared to previous benchmarks. This new benchmark offers a more demanding standard for evaluating long-horizon reasoning and context management capabilities in search agents.
Key takeaway
For Machine Learning Engineers developing or evaluating advanced search agents, recognize that traditional benchmarks like BrowseComp are no longer sufficient. Your models likely exceed 90% accuracy on these, but LoHoSearch reveals significant gaps, with top models achieving only 34.74%. You should prioritize developing agents capable of robust long-horizon reasoning and more effective context management, as current strategies show limited gains on this new, complex benchmark.
Key insights
Human-authored benchmarks limit complexity, necessitating automated knowledge graph-driven generation for challenging long-horizon search tasks.
Principles
- Human-authored benchmarks hit a difficulty ceiling.
- Automated generation can create complex search spaces.
- Knowledge graphs enable systematic question construction.
Method
LoHoSearch uses an automated pipeline built on a knowledge graph of 7M+ Wikipedia entities to select relations with large search spaces and assemble complex questions with KG-verified unique answers.
In practice
- Evaluate search agents on complex, multi-hop queries.
- Test context management strategies rigorously.
- Develop new agents for long-horizon reasoning.
Topics
- Long-Horizon Search
- Search Agents
- Benchmarking
- Knowledge Graphs
- Context Management
- Wikipedia Entities
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.