ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer
Summary
ADK Arena introduces LLM-as-a-Developer, a novel methodology for empirically evaluating 51 popular Python Agent Development Kits (ADKs) by replacing human developers with an LLM coding agent. This automated pipeline learns framework APIs from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop. The study found that generation succeeds for 57% of runs, with costs varying 5.6x (\$0.6 to \$3.4 per agent), indicating API complexity. While no single framework dominates, the best ADK agents resolve up to 80% of tasks and can outperform general-purpose frontier coding agents at a fraction of the cost, though the median framework resolves only 32%. Crucially, the LLM model used to *write* the agent significantly impacts performance, with Opus-authored agents resolving roughly twice as many tasks as GPT-authored ones. Information source ablations revealed that documentation, source code, and parametric knowledge are largely substitutable, with genuine framework usage staying within a 28-40% band.
Key takeaway
For AI Engineers evaluating Agent Development Kits, you should prioritize frameworks that demonstrate lower generation costs in automated evaluations, as this signals better API usability. Be aware that the LLM model you use to *generate* the agent code significantly impacts its performance, often more than the execution model. Consider that raw source code access can be more effective for LLM-driven development than curated documentation, potentially leading to more native framework usage.
Key insights
LLM-as-a-Developer provides a scalable, unbiased method to evaluate Agent Development Kits and quantify API usability.
Principles
- Generation cost quantifies API complexity.
- Developer LLM choice impacts agent performance more than execution LLM.
- Information sources (docs, source, parametric) are largely interchangeable for LLM developers.
Method
The LLM-as-a-Developer learns framework APIs from documentation, writes agent code, and iteratively repairs it via a three-level validation pipeline until tests pass.
In practice
- Use generation cost as a proxy for API complexity when selecting ADKs.
- Prioritize developer LLM quality for agent code generation.
- Consider raw source access for LLM agent development over curated docs.
Topics
- Agent Development Kits
- LLM-as-a-Developer
- API Usability
- Framework Evaluation
- Autonomous Agents
- Software Engineering Benchmarks
Code references
Best for: AI Architect, Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.