The Amazing Agent Race: Strong Tool Users, Weak Navigators

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

The Amazing Agent Race (AAR) is a new benchmark designed to evaluate the navigation, tool-use, and reasoning capabilities of LLM agents, specifically focusing on non-linear, directed acyclic graph (DAG) structured tasks. Unlike existing benchmarks, which are predominantly linear (55-100% of instances), AAR features 1,400 instances across sequential (800 legs) and compositional (600 DAG legs) variants, requiring agents to navigate Wikipedia, execute multi-step tool chains, and aggregate results. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Evaluation of three agent frameworks (Codex CLI, Claude Code, mini-swe-agent) on these legs shows that the best agent achieves only 37.2% accuracy. Navigation errors are the dominant failure mode (27-52% of trials), while tool-use errors remain below 17%. Agent architecture proves as critical as model scale, with Claude Code matching Codex CLI's performance using 6x fewer tokens.

Key takeaway

For AI Architects and NLP Engineers developing LLM agents, this research highlights a critical need to enhance navigation capabilities, especially in tasks requiring information discovery across complex, non-linear paths. Your focus should shift from merely improving tool-calling competence to developing more robust, targeted retrieval mechanisms and better handling of compositional arithmetic. Consider agent architectures that balance exploration depth with computational efficiency, as demonstrated by Claude Code's performance with fewer tokens, to improve overall task accuracy and resource utilization.

Key insights

LLM agents struggle with navigation in complex, non-linear tasks more than with tool execution.

Principles

Method

The AAR benchmark uses an eight-step automated pipeline to generate DAG-structured tasks from Wikipedia seeds, incorporating fork-merge diamond patterns and live API calls, with three decomposed metrics for failure diagnosis.

In practice

Topics

Code references

Best for: AI Architect, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Counsel's verdict on this

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.