Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Asuka-Bench is a novel benchmark designed to evaluate code agents on real-world web development scenarios, specifically addressing underspecified user intent and multi-round iterative refinement. Unlike traditional one-shot benchmarks, Asuka-Bench employs a closed-loop system where a Code Agent generates a web project, a UI Agent executes browser-based test cases, and a User LLM provides natural-language feedback for iterative refinement. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2,402 expected outcomes, used to assess 8 LLMs across 2 agent frameworks. Results show a 38 percentage point variation in weighted Task Pass Rate, highlighting significant differences in models' ability to repair from feedback. Even the strongest model completes only 52% of projects after three rounds, indicating the benchmark is far from saturated.

Key takeaway

For ML Engineers developing code agents for web applications, relying solely on one-shot benchmarks is insufficient. You should prioritize evaluating your agents' ability to handle underspecified user intent and iteratively refine code based on feedback. Focus development efforts on improving repair-from-feedback capabilities, particularly for robustness and complex functionality, as these areas currently present the largest performance gaps and offer significant differentiation for agent performance.

Key insights

Real-world code generation demands iterative refinement from underspecified user intent, a capability Asuka-Bench effectively measures.

Principles

Iterative refinement is crucial for real-world code agents.
Repair-from-feedback is a distinct model capability.
Browser behavior grounds web development evaluation.

Method

A closed loop: Code Agent generates, UI Agent tests browser behavior, User LLM synthesizes feedback, then Code Agent refines iteratively.

In practice

Employ DAG-aware evaluation for focused feedback.
Test agents with underspecified prompts.
Focus on robustness and functionality tasks.

Topics

Code Agents
LLM Benchmarking
Web Development
Iterative Refinement
Underspecified User Intent
Automated Evaluation

Code references

coffeegrind123/gemini-for-claude-code

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.