Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

2025-04-14 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

BenchAgent, an evaluation framework, reveals that multi-agent system (MAS) workflows do not consistently outperform single-agent setups under normalized conditions. The framework compared single-agent, fixed MAS, and evolving MAS workflows using GPT-4.1 across ten reasoning, coding, and tool-use benchmarks. Results showed that only one of six tested MAS, EvoAgent, numerically exceeded the single-agent anchor on benchmark-balanced average accuracy (75.56% vs. 74.12%), a marginal +1.44-point gain within one-run uncertainty. The other five MAS trailed by 2.56–11.29 points and exhibited less favorable accuracy–cost trade-offs. However, a Protocol-Aligned External (PAE) GAIA study of a Claude-Code-style runtime workflow achieved 66.72% overall accuracy and 69.23% on Level 3, surpassing the strongest non-Claude baseline (Jarvis, 46.66%) by over 20 points, indicating potential for sophisticated runtime-generated workflows on complex tasks.

Key takeaway

For AI Scientists and Machine Learning Engineers designing LLM agent systems, you should critically evaluate multi-agent system (MAS) benefits. Simply adding agents does not guarantee performance lift; instead, focus on workflow organization and task-protocol fit. Consider advanced runtime-generated workflows for complex, long-horizon tasks, as they demonstrate significant accuracy and cost advantages. Prioritize rigorous, protocol-aligned evaluation to avoid misleading comparisons.

Key insights

Multi-agent LLM workflows do not consistently outperform single-agent setups under controlled evaluation, except for advanced runtime-generated systems.

Principles

Workflow organization, not agent count, drives performance.
MAS gains are task-dependent, matching task error modes.

Method

BenchAgent normalizes LLM agent workflow evaluation by aligning benchmark loading, tool access, answer contracts, usage accounting, and trajectory logging for SI and PAE comparisons.

In practice

Implement a strong single-agent baseline first.
Evaluate MAS against task-specific error modes.
Prioritize runtime-generated workflows for complex tasks.

Topics

LLM Agent Workflows
Multi-Agent Systems
BenchAgent Framework
GAIA Benchmark
Workflow Evaluation
Claude Code

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.