ClawBench: Can AI Agents Complete Everyday Online Tasks?

2026-04-09 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

ClawBench is a new evaluation framework comprising 153 everyday online tasks across 144 live platforms and 15 categories, designed to test AI agents' ability to automate routine life and work activities. Unlike existing benchmarks that use static, offline sandboxes, ClawBench operates on production websites, capturing the full complexity and dynamic nature of real-world web interaction. Tasks range from completing purchases and booking appointments to submitting job applications, requiring capabilities like extracting information from user documents, navigating multi-step workflows, and filling detailed forms. A lightweight interception layer ensures safe evaluation by blocking only final submission requests. Initial evaluations of 7 frontier models, including Claude Sonnet 4.6, show that both proprietary and open-source models complete only a small fraction of these tasks; for instance, Claude Sonnet 4.6 achieved 33.3%.

Key takeaway

For research scientists developing AI agents, you should prioritize improving capabilities for navigating dynamic, multi-step online workflows and accurately handling write-heavy operations on live production websites. The low success rates on ClawBench indicate that current models are far from reliable general-purpose assistants, highlighting critical areas for your future development efforts.

Key insights

ClawBench evaluates AI agents on 153 real-world online tasks across live platforms, revealing current models' significant limitations.

Principles

Real-world web interaction is complex.
Dynamic environments challenge AI agents.
Safe evaluation requires submission interception.

Method

ClawBench uses a framework of 153 tasks on 144 live production websites, employing a lightweight interception layer to block final submission requests, ensuring safe evaluation without real-world side effects.

In practice

Test agents on multi-step workflows.
Incorporate document-based information retrieval.
Design for dynamic web page changes.

Topics

ClawBench
AI Agents
Online Task Automation
Web Interaction
Evaluation Frameworks

Best for: Research Scientist, AI Scientist, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.