LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

LoHoSearch is a new, challenging benchmark for long-horizon search agents, designed to overcome the difficulty ceiling of human-authored benchmarks like BrowseComp, which models rapidly saturated (exceeding 90% accuracy). Developed by Meituan, LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline systematically maximizes search space size and structural complexity. Evaluations show that even the strongest model, GPT-5.5, achieves only 34.74% accuracy on LoHoSearch. Furthermore, existing context management strategies yield only a 6.8% improvement, significantly less than on prior benchmarks, and correct trajectories require 1.7x more tool calls than BrowseComp.

Key takeaway

For AI Engineers developing search agents, or Directors of AI/ML evaluating agent performance, current benchmarks like BrowseComp no longer provide a discriminative measure of capability. You should prioritize research into novel context management and reasoning architectures, as existing strategies offer minimal gains on complex, long-horizon tasks. Relying on older benchmarks will lead to inflated performance estimates and hinder progress in true agentic intelligence.

Key insights

Automated, knowledge graph-driven benchmark generation creates significantly harder search tasks for advanced AI agents.

Principles

Human-authored benchmarks inherently limit difficulty.
Search difficulty increases with search space size.
Structural complexity prevents problem decomposition.

Method

An automated pipeline constructs a knowledge graph, samples structurally complex subgraphs, generates natural language questions, and verifies uniqueness and difficulty.

In practice

Evaluate search agents on LoHoSearch for true capabilities.
Develop novel context management strategies.

Topics

LoHoSearch
Search Agents
Benchmark Design
Knowledge Graphs
Context Management
Long-Horizon Reasoning

Best for: Research Scientist, AI Scientist, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.