Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement
Summary
Asuka-Bench is a novel benchmark designed to evaluate code agents on real-world web development scenarios, specifically addressing underspecified user intent and multi-round iterative refinement. Unlike traditional one-shot benchmarks, Asuka-Bench employs a closed-loop system where a Code Agent generates a web project, a UI Agent executes browser-based test cases, and a User LLM provides natural-language feedback for iterative refinement. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2,402 expected outcomes, used to assess 8 LLMs across 2 agent frameworks. Results show a 38 percentage point variation in weighted Task Pass Rate, highlighting significant differences in models' ability to repair from feedback. Even the strongest model completes only 52% of projects after three rounds, indicating the benchmark is far from saturated.
Key takeaway
For ML Engineers developing code agents for web applications, relying solely on one-shot benchmarks is insufficient. You should prioritize evaluating your agents' ability to handle underspecified user intent and iteratively refine code based on feedback. Focus development efforts on improving repair-from-feedback capabilities, particularly for robustness and complex functionality, as these areas currently present the largest performance gaps and offer significant differentiation for agent performance.
Key insights
Real-world code generation demands iterative refinement from underspecified user intent, a capability Asuka-Bench effectively measures.
Principles
- Iterative refinement is crucial for real-world code agents.
- Repair-from-feedback is a distinct model capability.
- Browser behavior grounds web development evaluation.
Method
A closed loop: Code Agent generates, UI Agent tests browser behavior, User LLM synthesizes feedback, then Code Agent refines iteratively.
In practice
- Employ DAG-aware evaluation for focused feedback.
- Test agents with underspecified prompts.
- Focus on robustness and functionality tasks.
Topics
- Code Agents
- LLM Benchmarking
- Web Development
- Iterative Refinement
- Underspecified User Intent
- Automated Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.