Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
Summary
Anthropic engineers Ash and Andrew discussed the evolution and advanced patterns for building long-running AI agents, capable of operating for hours or even days. They highlighted three core challenges: finite context windows leading to "context rot" and "context anxiety," poor out-of-the-box planning capabilities, and models' inability to accurately judge their own output. Solutions involve both model improvements, such as the progression from Opus 3.7 (1-hour task completion) to Opus 4.6 (12-hour task completion) on minimal scaffolds, and significant harness changes. Key harness developments include the Agent SDK, sub-agents, skills with progressive disclosure, and server-side compaction. A notable advanced pattern is the Generator-Evaluator model, inspired by GANs, where a generator agent builds and a separate, tuned evaluator agent critiques, often using tools like Playwright for live testing. This adversarial approach, combined with a planner role and negotiation between agents, significantly improves the quality and autonomy of long-running agent-built applications, as demonstrated by a "Retro Game Maker" example.
Key takeaway
For AI Engineers building complex, long-running applications, adopting an adversarial Generator-Evaluator agent pattern can dramatically improve output quality and autonomy. Your team should focus on tuning a separate, harsh critic agent and establishing clear negotiation protocols between builder and evaluator agents to define "done." This approach, combined with continuous trace analysis to refine prompts and harness components, will enable more robust and self-correcting AI systems, moving beyond single-session limitations.
Key insights
Adversarial agent patterns and evolving harnesses enable AI agents to perform complex, long-duration tasks autonomously.
Principles
- Separate generation from evaluation for robust agent performance.
- Harness design must co-evolve with model capabilities.
- Granular, explicit rubrics improve agent self-correction.
Method
Implement a Generator-Evaluator pattern: a generator agent builds, while a separate, tuned evaluator agent critiques using tools like Playwright. Add a planner for high-level task decomposition and agent negotiation for feature contracts.
In practice
- Use the Agent SDK for building custom agent harnesses.
- Employ Playwright or Claude for Chrome MTP for web app evaluation.
- Define detailed rubrics for design, originality, craft, and functionality.
Topics
- Long-Running AI Agents
- Agent Harness Design
- Generator-Evaluator Pattern
- Anthropic Claude
- Multi-Agent Systems
Best for: AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.