Harness design for long-running application development

· Source: Anthropic Engineering Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Anthropic's latest research, published March 24, 2026, details advancements in harness design for long-running agentic coding, specifically for frontend design and full-stack application development using Claude. The work introduces a multi-agent architecture, inspired by Generative Adversarial Networks (GANs), featuring generator and evaluator agents to overcome limitations in single-agent performance, particularly in subjective tasks like design and complex, multi-hour coding sessions. Key innovations include developing concrete grading criteria for subjective design judgments, implementing context resets for Claude Sonnet 4.5 to mitigate "context anxiety," and evolving to a three-agent system (planner, generator, evaluator) for full-stack development. This approach significantly improved output quality, as demonstrated by a retro game maker application, despite increasing costs from $9 for a solo run to $200 for a full harness run over 6 hours.

Key takeaway

For AI Architects and Research Scientists developing long-running agentic systems, consider adopting a multi-agent harness design with specialized roles. Your team should prioritize explicit evaluation criteria and iterative feedback loops to enhance output quality and manage complexity, especially for tasks involving subjective judgment or extended execution. Be prepared for increased computational costs, but recognize the substantial gains in application robustness and feature richness.

Key insights

Multi-agent harnesses with distinct generator and evaluator roles significantly improve AI agent performance on complex, subjective, and long-running tasks.

Principles

Method

A multi-agent system (planner, generator, evaluator) iteratively refines outputs. The planner expands prompts, the generator builds in sprints, and the evaluator provides structured feedback against explicit criteria, often using tools like Playwright for interactive testing.

In practice

Topics

Code references

Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Engineering Blog.