ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer
Summary
ADK Arena introduces LLM-as-a-Developer, a novel methodology and automated pipeline for empirically evaluating Agent Development Kits (ADKs). This addresses the gap in understanding how framework choice affects LLM-powered autonomous agent performance. LLM-as-a-Developer uses an LLM coding agent to learn framework APIs from documentation, write agent code, and iteratively repair it. ADK Arena, featuring Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas, evaluated all 51 popular Python ADK frameworks (204 agent--benchmark pairs). Findings show 57% generation success, with costs varying 5.6 times (\$0.6 to \$3.4 per agent), though cost doesn't predict success. No single framework dominates; top ADK agents resolve up to 80% of tasks and can outperform general-purpose frontier coding agents at lower cost, yet the median resolves only 32%. Framework usage (28--40%) across information sources suggests documentation, source, and parametric knowledge are largely substitutable.
Key takeaway
For AI Engineers evaluating Agent Development Kits for autonomous agent development, this research indicates that framework choice significantly impacts both agent performance and development cost. While some ADK agents can achieve up to 80% task resolution and outperform general-purpose coding agents, the median performance is much lower, and no single framework dominates. You should conduct targeted evaluations using methodologies like LLM-as-a-Developer to assess API complexity and agent effectiveness for your specific use cases, rather than relying on general popularity.
Key insights
LLM-as-a-Developer provides an automated, empirical method to evaluate Agent Development Kits, revealing varied performance and API complexity.
Principles
- Generation cost is a quantitative proxy for API complexity.
- No single ADK framework universally dominates task resolution.
- Documentation, source, and parametric knowledge are largely substitutable for LLM agents.
Method
LLM-as-a-Developer replaces human developers with an LLM coding agent that learns APIs from documentation, writes code, and iteratively repairs it via a validate-and-feedback loop.
In practice
- Use LLM-as-a-Developer for automated API usability testing.
- Benchmark ADK agents against frontier coding agents.
- Evaluate framework effectiveness using generation effort.
Topics
- Agent Development Kits
- LLM-as-a-Developer
- Autonomous Agents
- API Usability
- LLM Benchmarking
- Software Engineering
Best for: Machine Learning Engineer, Research Scientist, AI Scientist, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.