Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

2026-05-18 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Anthropic engineers Ash and Andrew discussed the evolution and advanced patterns for building long-running AI agents, capable of operating for hours or even days. They highlighted three core challenges: finite context windows leading to "context rot" and "context anxiety," poor out-of-the-box planning capabilities, and models' inability to accurately judge their own output. Solutions involve both model improvements, such as the progression from Opus 3.7 (1-hour task completion) to Opus 4.6 (12-hour task completion) on minimal scaffolds, and significant harness changes. Key harness developments include the Agent SDK, sub-agents, skills with progressive disclosure, and server-side compaction. A notable advanced pattern is the Generator-Evaluator model, inspired by GANs, where a generator agent builds and a separate, tuned evaluator agent critiques, often using tools like Playwright for live testing. This adversarial approach, combined with a planner role and negotiation between agents, significantly improves the quality and autonomy of long-running agent-built applications, as demonstrated by a "Retro Game Maker" example.

Key takeaway

For AI Engineers building complex, long-running applications, adopting an adversarial Generator-Evaluator agent pattern can dramatically improve output quality and autonomy. Your team should focus on tuning a separate, harsh critic agent and establishing clear negotiation protocols between builder and evaluator agents to define "done." This approach, combined with continuous trace analysis to refine prompts and harness components, will enable more robust and self-correcting AI systems, moving beyond single-session limitations.

Key insights

Adversarial agent patterns and evolving harnesses enable AI agents to perform complex, long-duration tasks autonomously.

Principles

Separate generation from evaluation for robust agent performance.
Harness design must co-evolve with model capabilities.
Granular, explicit rubrics improve agent self-correction.

Method

Implement a Generator-Evaluator pattern: a generator agent builds, while a separate, tuned evaluator agent critiques using tools like Playwright. Add a planner for high-level task decomposition and agent negotiation for feature contracts.

In practice

Use the Agent SDK for building custom agent harnesses.
Employ Playwright or Claude for Chrome MTP for web app evaluation.
Define detailed rubrics for design, originality, craft, and functionality.

Topics

Long-Running AI Agents
Agent Harness Design
Generator-Evaluator Pattern
Anthropic Claude
Multi-Agent Systems

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.