ClawBench: Can AI Agents Complete Everyday Online Tasks?
Summary
ClawBench is a new evaluation framework comprising 153 everyday online tasks across 144 live platforms and 15 categories, designed to test AI agents' ability to automate routine life and work activities. Unlike existing benchmarks that use static, offline sandboxes, ClawBench operates on production websites, capturing the full complexity and dynamic nature of real-world web interaction. Tasks range from completing purchases and booking appointments to submitting job applications, requiring capabilities like extracting information from user documents, navigating multi-step workflows, and filling detailed forms. A lightweight interception layer ensures safe evaluation by blocking only final submission requests. Initial evaluations of 7 frontier models, including Claude Sonnet 4.6, show that both proprietary and open-source models complete only a small fraction of these tasks; for instance, Claude Sonnet 4.6 achieved 33.3%.
Key takeaway
For research scientists developing AI agents, you should prioritize improving capabilities for navigating dynamic, multi-step online workflows and accurately handling write-heavy operations on live production websites. The low success rates on ClawBench indicate that current models are far from reliable general-purpose assistants, highlighting critical areas for your future development efforts.
Key insights
ClawBench evaluates AI agents on 153 real-world online tasks across live platforms, revealing current models' significant limitations.
Principles
- Real-world web interaction is complex.
- Dynamic environments challenge AI agents.
- Safe evaluation requires submission interception.
Method
ClawBench uses a framework of 153 tasks on 144 live production websites, employing a lightweight interception layer to block final submission requests, ensuring safe evaluation without real-world side effects.
In practice
- Test agents on multi-step workflows.
- Incorporate document-based information retrieval.
- Design for dynamic web page changes.
Topics
- ClawBench
- AI Agents
- Online Task Automation
- Web Interaction
- Evaluation Frameworks
Best for: Research Scientist, AI Scientist, AI Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.