Karpathy’s Autoresearch for Agent Engineering
Summary
Kevin Gu has open-sourced AutoAgent, a library that applies Karpathy's autoresearch concept to agent engineering, allowing a meta-agent to autonomously improve a task agent's prompts, tools, and orchestration logic through thousands of parallel sandboxed experiments. AutoAgent achieved top scores on SpreadsheetBench (96.5%) and TerminalBench (55.1%), surpassing hand-engineered solutions. A key finding is that same-model pairings (e.g., Claude meta + Claude task) perform better due to "model empathy." The system demonstrated emergent behaviors like inventing spot-checking, building verification loops, writing unit tests, and spinning up sub-agents. Crucially, providing full reasoning trajectories to the meta-agent, rather than just benchmark scores, enabled significant harness edits. Additionally, Anthropic has ended subscription coverage for third-party tools like OpenClaw, requiring users to switch to an Anthropic API key with pay-as-you-go billing.
Key takeaway
For AI Architects designing and deploying agentic systems, AutoAgent presents a compelling approach to automate agent optimization. You should investigate integrating AutoAgent-like methodologies to continuously refine task-specific agents across your organization, especially for complex domains where manual tuning is impractical. Be aware of the recent change in Anthropic's subscription policy for third-party tools like OpenClaw, and ensure your agent workflows are configured with pay-as-you-go API keys to avoid service disruption.
Key insights
AutoAgent autonomously optimizes AI agent harnesses through meta-agent experimentation and "model empathy."
Principles
- Autoresearch applies to agent engineering.
- Same-model pairings enhance meta-agent performance.
- Full reasoning traces are vital for targeted optimization.
Method
A meta-agent runs parallel sandboxed experiments, editing a task agent's harness (prompts, tools, orchestration) and hill-climbing on benchmark scores, using full reasoning trajectories for targeted improvements.
In practice
- Explore AutoAgent for domain-specific agent optimization.
- Consider same-model LLM pairings for meta-agent tasks.
- Shift OpenClaw workflows to Anthropic API keys.
Topics
- AutoAgent
- Agent Engineering
- Google Gemma 4
- gstack
- OpenClaw Dreaming
Code references
Best for: AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by unwind ai.