ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ADK Arena introduces LLM-as-a-Developer, a novel methodology for empirically evaluating 51 popular Python Agent Development Kits (ADKs) by replacing human developers with an LLM coding agent. This automated pipeline learns framework APIs from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop. The study found that generation succeeds for 57% of runs, with costs varying 5.6x (\$0.6 to \$3.4 per agent), indicating API complexity. While no single framework dominates, the best ADK agents resolve up to 80% of tasks and can outperform general-purpose frontier coding agents at a fraction of the cost, though the median framework resolves only 32%. Crucially, the LLM model used to *write* the agent significantly impacts performance, with Opus-authored agents resolving roughly twice as many tasks as GPT-authored ones. Information source ablations revealed that documentation, source code, and parametric knowledge are largely substitutable, with genuine framework usage staying within a 28-40% band.

Key takeaway

For AI Engineers evaluating Agent Development Kits, you should prioritize frameworks that demonstrate lower generation costs in automated evaluations, as this signals better API usability. Be aware that the LLM model you use to *generate* the agent code significantly impacts its performance, often more than the execution model. Consider that raw source code access can be more effective for LLM-driven development than curated documentation, potentially leading to more native framework usage.

Key insights

LLM-as-a-Developer provides a scalable, unbiased method to evaluate Agent Development Kits and quantify API usability.

Principles

Method

The LLM-as-a-Developer learns framework APIs from documentation, writes agent code, and iteratively repairs it via a three-level validation pipeline until tests pass.

In practice

Topics

Code references

Best for: AI Architect, Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.