LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

LoHoSearch is a new, challenging benchmark for long-horizon search agents, designed to overcome the difficulty ceiling of human-authored benchmarks like BrowseComp, which models rapidly saturated (exceeding 90% accuracy). Developed by Meituan, LoHoSearch comprises 544 human-verified questions across 11 domains, constructed via an automated pipeline leveraging a knowledge graph of over 7 million Wikipedia entities. This pipeline systematically maximizes search space size and structural complexity. Evaluations show that even the strongest model, GPT-5.5, achieves only 34.74% accuracy on LoHoSearch. Furthermore, existing context management strategies yield only a 6.8% improvement, significantly less than on prior benchmarks, and correct trajectories require 1.7x more tool calls than BrowseComp.

Key takeaway

For AI Engineers developing search agents, or Directors of AI/ML evaluating agent performance, current benchmarks like BrowseComp no longer provide a discriminative measure of capability. You should prioritize research into novel context management and reasoning architectures, as existing strategies offer minimal gains on complex, long-horizon tasks. Relying on older benchmarks will lead to inflated performance estimates and hinder progress in true agentic intelligence.

Key insights

Automated, knowledge graph-driven benchmark generation creates significantly harder search tasks for advanced AI agents.

Principles

Method

An automated pipeline constructs a knowledge graph, samples structurally complex subgraphs, generates natural language questions, and verifies uniqueness and difficulty.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.