RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

RealClawBench is a novel live benchmark framework designed to evaluate developer-facing AI agents using tasks derived from real OpenClaw user sessions. It addresses the "realism gap" in existing benchmarks by capturing the actual distribution, diversity, and difficulty of deployed agent use. The framework converts 76,155 real sessions into 281 reproducible, automatically scored tasks by reconstructing execution environments and employing deterministic verifiers, maintaining a maximum Jensen-Shannon divergence of 0.0448 from the source distribution. Evaluation of 14 contemporary models, including Claude Opus 4.7 and GPT-5.5, revealed that the best system, Claude Opus 4.7, achieved only 65.8% success, costing \$46.54 for a full benchmark run, while MiMo V2.5 Pro offered a competitive 63.0% subtask average at \$11.87, indicating significant headroom for agents on realistic developer workloads. The system also supports live, versioned releases to combat staleness.

Key takeaway

For AI Scientists and ML Engineers developing agent systems, RealClawBench highlights the critical need to evaluate models against real-world, deployed user interactions rather than solely authored tasks. Your current agent's performance on traditional benchmarks may not reflect its true capability or cost-efficiency in production. Consider integrating real-session-derived benchmarks into your evaluation pipeline to identify actual performance headroom and optimize for realistic developer workloads, especially regarding cost-performance tradeoffs like MiMo V2.5 Pro's competitive offering.

Key insights

Agent benchmarks must reflect real-world deployed usage to accurately measure capability.

Principles

Method

RealClawBench samples deployed OpenClaw sessions, filters for quality, reconstructs execution environments, rewrites requests into standalone instructions, and builds deterministic verifiers to create reproducible tasks.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.