LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

2026-06-11 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, quick

Summary

LoHoSearch, a new benchmark for long-horizon search agents, addresses the saturation of existing benchmarks like BrowseComp, where top models exceed 90% accuracy. Prior human-authored benchmarks face a difficulty ceiling due to annotators' inability to systematically maximize search space size and structural complexity. LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline selects relations with large search spaces and assembles structurally complex questions with KG-verified unique answers. Initial evaluations reveal that even the strongest models achieve only 34.74% accuracy on LoHoSearch, and existing context management strategies show significantly smaller gains (best +6.8%) compared to previous benchmarks. This new benchmark offers a more demanding standard for evaluating long-horizon reasoning and context management capabilities in search agents.

Key takeaway

For Machine Learning Engineers developing or evaluating advanced search agents, recognize that traditional benchmarks like BrowseComp are no longer sufficient. Your models likely exceed 90% accuracy on these, but LoHoSearch reveals significant gaps, with top models achieving only 34.74%. You should prioritize developing agents capable of robust long-horizon reasoning and more effective context management, as current strategies show limited gains on this new, complex benchmark.

Key insights

Human-authored benchmarks limit complexity, necessitating automated knowledge graph-driven generation for challenging long-horizon search tasks.

Principles

Human-authored benchmarks hit a difficulty ceiling.
Automated generation can create complex search spaces.
Knowledge graphs enable systematic question construction.

Method

LoHoSearch uses an automated pipeline built on a knowledge graph of 7M+ Wikipedia entities to select relations with large search spaces and assemble complex questions with KG-verified unique answers.

In practice

Evaluate search agents on complex, multi-hop queries.
Test context management strategies rigorously.
Develop new agents for long-horizon reasoning.

Topics

Long-Horizon Search
Search Agents
Benchmarking
Knowledge Graphs
Context Management
Wikipedia Entities

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.