LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, quick

Summary

LoHoSearch, a new benchmark for long-horizon search agents, addresses the saturation of existing benchmarks like BrowseComp, where top models exceed 90% accuracy. Prior human-authored benchmarks face a difficulty ceiling due to annotators' inability to systematically maximize search space size and structural complexity. LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline selects relations with large search spaces and assembles structurally complex questions with KG-verified unique answers. Initial evaluations reveal that even the strongest models achieve only 34.74% accuracy on LoHoSearch, and existing context management strategies show significantly smaller gains (best +6.8%) compared to previous benchmarks. This new benchmark offers a more demanding standard for evaluating long-horizon reasoning and context management capabilities in search agents.

Key takeaway

For Machine Learning Engineers developing or evaluating advanced search agents, recognize that traditional benchmarks like BrowseComp are no longer sufficient. Your models likely exceed 90% accuracy on these, but LoHoSearch reveals significant gaps, with top models achieving only 34.74%. You should prioritize developing agents capable of robust long-horizon reasoning and more effective context management, as current strategies show limited gains on this new, complex benchmark.

Key insights

Human-authored benchmarks limit complexity, necessitating automated knowledge graph-driven generation for challenging long-horizon search tasks.

Principles

Method

LoHoSearch uses an automated pipeline built on a knowledge graph of 7M+ Wikipedia entities to select relations with large search spaces and assemble complex questions with KG-verified unique answers.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.