Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work
Summary
A pilot study conducted in June 2026 investigated the impact of "software delegation contracts" on AI coding agent work reviewability. Researchers built an instrumented TypeScript API with seeded defects and documentation gaps, executing 64 coding-agent runs across two model tiers, Sonnet 4.6 and Haiku 4.5. Agents operated under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract requiring an evidence bundle. While all 64 runs passed hidden acceptance checks with zero scope violations, indicating objective outcomes saturated, explicit contracts significantly improved reviewability. Evidence sufficiency rose in 22 of 30 paired comparisons (+0.83 on a 5-point scale, p < 0.0001), and reviewer ambiguity decreased (p = 0.035). Structured work-package elements like changed-file lists (7% to 93%) and known-limitations sections (0% to 80%) appeared almost exclusively under contracts. This reviewability came at a cost of +13% agent tokens and +38% wall-clock time, with the weaker Haiku 4.5 model tier benefiting more. The study concludes that delegation contracts enhance reviewability rather than correctness for small, well-specified tasks.
Key takeaway
For AI Engineers building or integrating coding agents, if your goal is to ensure reviewable work, implement explicit delegation contracts. These contracts significantly improve evidence sufficiency and reduce reviewer ambiguity, even if the agent's objective correctness is already high. Be prepared for a ~13% increase in agent tokens and ~38% more wall-clock time, recognizing this overhead buys crucial review surface, especially with weaker models.
Key insights
Explicit delegation contracts for AI coding agents significantly improve work package reviewability, not objective correctness, at a measurable cost.
Principles
- Delegation contracts enhance reviewability.
- Evidence provision is demand-elastic.
- Weaker models benefit more from contracts.
Method
The study used a controlled pilot design with an instrumented TypeScript API, 10 seeded tasks, and 64 agent runs under varying contract explicitness. Outcomes were measured mechanically and by blinded model-based reviewers.
In practice
- Require changed-files-with-reasons lists.
- Demand known-limitations and residual-risk sections.
- Account for ~15% token, ~40% latency overhead.
Topics
- AI Coding Agents
- Software Delegation Contracts
- Reviewability
- LLM Evaluation
- MLOps
- Anthropic Claude
Best for: Research Scientist, AI Architect, AI Product Manager, AI Scientist, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.