Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Anthropic engineers Ash and Andrew discussed the evolution and advanced patterns for building long-running AI agents, capable of operating for hours or even days. They highlighted three core challenges: finite context windows leading to "context rot" and "context anxiety," poor out-of-the-box planning capabilities, and models' inability to accurately judge their own output. Solutions involve both model improvements, such as the progression from Opus 3.7 (1-hour task completion) to Opus 4.6 (12-hour task completion) on minimal scaffolds, and significant harness changes. Key harness developments include the Agent SDK, sub-agents, skills with progressive disclosure, and server-side compaction. A notable advanced pattern is the Generator-Evaluator model, inspired by GANs, where a generator agent builds and a separate, tuned evaluator agent critiques, often using tools like Playwright for live testing. This adversarial approach, combined with a planner role and negotiation between agents, significantly improves the quality and autonomy of long-running agent-built applications, as demonstrated by a "Retro Game Maker" example.

Key takeaway

For AI Engineers building complex, long-running applications, adopting an adversarial Generator-Evaluator agent pattern can dramatically improve output quality and autonomy. Your team should focus on tuning a separate, harsh critic agent and establishing clear negotiation protocols between builder and evaluator agents to define "done." This approach, combined with continuous trace analysis to refine prompts and harness components, will enable more robust and self-correcting AI systems, moving beyond single-session limitations.

Key insights

Adversarial agent patterns and evolving harnesses enable AI agents to perform complex, long-duration tasks autonomously.

Principles

Method

Implement a Generator-Evaluator pattern: a generator agent builds, while a separate, tuned evaluator agent critiques using tools like Playwright. Add a planner for high-level task decomposition and agent negotiation for feature contracts.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.