Harness design for long-running application development
Summary
Anthropic's latest research, published March 24, 2026, details advancements in harness design for long-running agentic coding, specifically for frontend design and full-stack application development using Claude. The work introduces a multi-agent architecture, inspired by Generative Adversarial Networks (GANs), featuring generator and evaluator agents to overcome limitations in single-agent performance, particularly in subjective tasks like design and complex, multi-hour coding sessions. Key innovations include developing concrete grading criteria for subjective design judgments, implementing context resets for Claude Sonnet 4.5 to mitigate "context anxiety," and evolving to a three-agent system (planner, generator, evaluator) for full-stack development. This approach significantly improved output quality, as demonstrated by a retro game maker application, despite increasing costs from $9 for a solo run to $200 for a full harness run over 6 hours.
Key takeaway
For AI Architects and Research Scientists developing long-running agentic systems, consider adopting a multi-agent harness design with specialized roles. Your team should prioritize explicit evaluation criteria and iterative feedback loops to enhance output quality and manage complexity, especially for tasks involving subjective judgment or extended execution. Be prepared for increased computational costs, but recognize the substantial gains in application robustness and feature richness.
Key insights
Multi-agent harnesses with distinct generator and evaluator roles significantly improve AI agent performance on complex, subjective, and long-running tasks.
Principles
- Decompose complex tasks into tractable chunks.
- Separate generation from evaluation for objective feedback.
- Continuously re-examine harness components as models improve.
Method
A multi-agent system (planner, generator, evaluator) iteratively refines outputs. The planner expands prompts, the generator builds in sprints, and the evaluator provides structured feedback against explicit criteria, often using tools like Playwright for interactive testing.
In practice
- Use explicit grading criteria for subjective tasks.
- Implement context resets for models exhibiting "context anxiety."
- Employ a sprint contract for generator-evaluator agreement.
Topics
- Agentic AI
- Multi-Agent Systems
- AI Engineering
- Autonomous Software Development
- Frontend Design
Code references
Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Engineering Blog.