The Amazing Agent Race: Strong Tool Users, Weak Navigators

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

The Amazing Agent Race (AAR) is a new benchmark designed to evaluate the navigation, tool-use, and reasoning capabilities of LLM agents, specifically focusing on non-linear, directed acyclic graph (DAG) structured tasks. Unlike existing benchmarks, which are predominantly linear (55-100% of instances), AAR features 1,400 instances across sequential (800 legs) and compositional (600 DAG legs) variants, requiring agents to navigate Wikipedia, execute multi-step tool chains, and aggregate results. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Evaluation of three agent frameworks (Codex CLI, Claude Code, mini-swe-agent) on these legs shows that the best agent achieves only 37.2% accuracy. Navigation errors are the dominant failure mode (27-52% of trials), while tool-use errors remain below 17%. Agent architecture proves as critical as model scale, with Claude Code matching Codex CLI's performance using 6x fewer tokens.

Key takeaway

For AI Architects and NLP Engineers developing LLM agents, this research highlights a critical need to enhance navigation capabilities, especially in tasks requiring information discovery across complex, non-linear paths. Your focus should shift from merely improving tool-calling competence to developing more robust, targeted retrieval mechanisms and better handling of compositional arithmetic. Consider agent architectures that balance exploration depth with computational efficiency, as demonstrated by Claude Code's performance with fewer tokens, to improve overall task accuracy and resource utilization.

Key insights

LLM agents struggle with navigation in complex, non-linear tasks more than with tool execution.

Principles

Compositional task structures amplify navigation challenges.
Agent architecture significantly impacts performance and token efficiency.
Extended internal reasoning can hinder performance in time-constrained agentic tasks.

Method

The AAR benchmark uses an eight-step automated pipeline to generate DAG-structured tasks from Wikipedia seeds, incorporating fork-merge diamond patterns and live API calls, with three decomposed metrics for failure diagnosis.

In practice

Prioritize targeted retrieval over increased search volume for agent design.
Implement arithmetic verification for final computation steps.
Calibrate exploration depth with adaptive step budgets.

Topics

The Amazing Agent Race
LLM Agent Benchmarking
Directed Acyclic Graph
Wikipedia Navigation
Compositional Tool Use

Code references

Best for: AI Architect, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Counsel's verdict on this

AIssential's Counsel cites this article in its editorial verdict on the decision it informs:

Build agents that own workflows — or workflows that own LLM calls? — Autonomous agents add latency and debugging complexity, with navigation errors dominating 27-52% of trials, while deterministic workflows reduce tool calls by up to 81.8% and resolve un-debuggable failures.

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.