Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A pilot study conducted in June 2026 investigated the impact of "software delegation contracts" on AI coding agent work reviewability. Researchers built an instrumented TypeScript API with seeded defects and documentation gaps, executing 64 coding-agent runs across two model tiers, Sonnet 4.6 and Haiku 4.5. Agents operated under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract requiring an evidence bundle. While all 64 runs passed hidden acceptance checks with zero scope violations, indicating objective outcomes saturated, explicit contracts significantly improved reviewability. Evidence sufficiency rose in 22 of 30 paired comparisons (+0.83 on a 5-point scale, p < 0.0001), and reviewer ambiguity decreased (p = 0.035). Structured work-package elements like changed-file lists (7% to 93%) and known-limitations sections (0% to 80%) appeared almost exclusively under contracts. This reviewability came at a cost of +13% agent tokens and +38% wall-clock time, with the weaker Haiku 4.5 model tier benefiting more. The study concludes that delegation contracts enhance reviewability rather than correctness for small, well-specified tasks.

Key takeaway

For AI Engineers building or integrating coding agents, if your goal is to ensure reviewable work, implement explicit delegation contracts. These contracts significantly improve evidence sufficiency and reduce reviewer ambiguity, even if the agent's objective correctness is already high. Be prepared for a ~13% increase in agent tokens and ~38% more wall-clock time, recognizing this overhead buys crucial review surface, especially with weaker models.

Key insights

Explicit delegation contracts for AI coding agents significantly improve work package reviewability, not objective correctness, at a measurable cost.

Principles

Method

The study used a controlled pilot design with an instrumented TypeScript API, 10 seeded tasks, and 64 agent runs under varying contract explicitness. Outcomes were measured mechanically and by blinded model-based reviewers.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Product Manager, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.