Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Asuka-Bench is a novel benchmark designed to evaluate code agents on real-world web development scenarios, specifically addressing underspecified user intent and multi-round iterative refinement. Unlike traditional one-shot benchmarks, Asuka-Bench employs a closed-loop system where a Code Agent generates a web project, a UI Agent executes browser-based test cases, and a User LLM provides natural-language feedback for iterative refinement. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2,402 expected outcomes, used to assess 8 LLMs across 2 agent frameworks. Results show a 38 percentage point variation in weighted Task Pass Rate, highlighting significant differences in models' ability to repair from feedback. Even the strongest model completes only 52% of projects after three rounds, indicating the benchmark is far from saturated.

Key takeaway

For ML Engineers developing code agents for web applications, relying solely on one-shot benchmarks is insufficient. You should prioritize evaluating your agents' ability to handle underspecified user intent and iteratively refine code based on feedback. Focus development efforts on improving repair-from-feedback capabilities, particularly for robustness and complex functionality, as these areas currently present the largest performance gaps and offer significant differentiation for agent performance.

Key insights

Real-world code generation demands iterative refinement from underspecified user intent, a capability Asuka-Bench effectively measures.

Principles

Method

A closed loop: Code Agent generates, UI Agent tests browser behavior, User LLM synthesizes feedback, then Code Agent refines iteratively.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.