GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

GTA-2 is a new hierarchical benchmark designed to evaluate General Tool Agents (GTAs) across both atomic tool use and complex, open-ended workflows. Developed by researchers from Shanghai Jiao Tong University, Shanghai AI Laboratory, and Tencent, GTA-2 addresses limitations in existing benchmarks that often rely on AI-generated queries, dummy tools, and limited system coordination. The benchmark comprises GTA-Atomic, which assesses short-horizon, closed-ended tool-use precision using real user queries and deployed tools, and GTA-Workflow, a novel component for long-horizon, open-ended productivity tasks. GTA-Workflow introduces a recursive checkpoint-based evaluation mechanism to decompose objectives into verifiable sub-goals, enabling unified assessment of both model capabilities and agent execution frameworks. Initial experiments reveal a significant performance gap, with frontier models struggling on atomic tasks (below 50% success) and largely failing on workflows, achieving only 14.39% success, highlighting the critical role of execution harness design.

Key takeaway

For AI Architects and Research Scientists developing general-purpose LLM agents, GTA-2's findings indicate that current frontier models are insufficient for complex, real-world workflows. You should prioritize research and development into robust execution harnesses and system-level designs, such as those seen in Manus and OpenClaw, rather than solely focusing on underlying model capacity. Your evaluation strategies should also incorporate checkpoint-guided feedback to better assess long-horizon task completion.

Key insights

GTA-2 benchmarks LLM agents on real-world atomic tool use and complex, open-ended workflows, revealing significant capability gaps.

Principles

Authenticity requires real user queries, deployed tools, and multimodal contexts.
Workflow evaluation needs recursive checkpoint-based sub-goal decomposition.
Execution harness design is critical for workflow completion beyond model capacity.

Method

GTA-2 evaluates agents using GTA-Atomic for short-horizon tasks and GTA-Workflow for long-horizon, open-ended tasks. GTA-Workflow employs a recursive checkpoint-based mechanism to assess deliverables via verifiable sub-goals.

In practice

Focus on improving agent execution frameworks.
Implement checkpoint-guided feedback in agent designs.
Prioritize real-world data for tool-use training.

Topics

General Tool Agents
LLM Evaluation
Agent Execution Frameworks
Long-horizon Workflows
Checkpoint-based Evaluation

Code references

open-compass/GTA

Best for: Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.