ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ADK Arena introduces LLM-as-a-Developer, a novel methodology for empirically evaluating 51 popular Python Agent Development Kits (ADKs) by replacing human developers with an LLM coding agent. This automated pipeline learns framework APIs from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop. The study found that generation succeeds for 57% of runs, with costs varying 5.6x (\$0.6 to \$3.4 per agent), indicating API complexity. While no single framework dominates, the best ADK agents resolve up to 80% of tasks and can outperform general-purpose frontier coding agents at a fraction of the cost, though the median framework resolves only 32%. Crucially, the LLM model used to *write* the agent significantly impacts performance, with Opus-authored agents resolving roughly twice as many tasks as GPT-authored ones. Information source ablations revealed that documentation, source code, and parametric knowledge are largely substitutable, with genuine framework usage staying within a 28-40% band.

Key takeaway

For AI Engineers evaluating Agent Development Kits, you should prioritize frameworks that demonstrate lower generation costs in automated evaluations, as this signals better API usability. Be aware that the LLM model you use to *generate* the agent code significantly impacts its performance, often more than the execution model. Consider that raw source code access can be more effective for LLM-driven development than curated documentation, potentially leading to more native framework usage.

Key insights

LLM-as-a-Developer provides a scalable, unbiased method to evaluate Agent Development Kits and quantify API usability.

Principles

Generation cost quantifies API complexity.
Developer LLM choice impacts agent performance more than execution LLM.
Information sources (docs, source, parametric) are largely interchangeable for LLM developers.

Method

The LLM-as-a-Developer learns framework APIs from documentation, writes agent code, and iteratively repairs it via a three-level validation pipeline until tests pass.

In practice

Use generation cost as a proxy for API complexity when selecting ADKs.
Prioritize developer LLM quality for agent code generation.
Consider raw source access for LLM agent development over curated docs.

Topics

Agent Development Kits
LLM-as-a-Developer
API Usability
Framework Evaluation
Autonomous Agents
Software Engineering Benchmarks

Code references

Best for: AI Architect, Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.