Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

Claw-SWE-Bench is a new multilingual benchmark and adapter protocol designed to fairly evaluate OpenClaw-style agent harnesses on coding tasks, addressing limitations of the original SWE-bench for generic agents. It establishes consistent evaluation settings, including a fixed prompt, runtime budget, and workspace contract. The full benchmark comprises 350 GitHub issue-resolution instances spanning 8 languages and 43 repositories, with a smaller 80-instance Claw-SWE-Bench Lite available for quicker validation. Initial evaluations show that adapter design is crucial, with OpenClaw achieving only 19.1% Pass@1 with a minimal adapter but reaching 73.4% with a full adapter using the GLM 5.1 backbone. The benchmark also reveals that model choice impacts Pass@1 by 29.4 percentage points, and harness choice by 27.4 percentage points, while highlighting significant variations in total API cost among systems with similar accuracy.

Key takeaway

For AI Engineers evaluating or deploying autonomous coding agents, your focus must extend beyond just the underlying language model. This research highlights that adapter design significantly influences agent performance, with a full adapter boosting Pass@1 from 19.1% to 73.4% for OpenClaw. You should rigorously test different harness configurations and prioritize cost accounting as a primary metric, as systems with similar accuracy can vary widely in API expenses. Utilize benchmarks like Claw-SWE-Bench to ensure fair, comprehensive evaluations.

Key insights

Adapter design and cost accounting are critical, often overlooked factors in evaluating coding agents.

Principles

Method

Claw-SWE-Bench proposes a protocol with fixed prompt, runtime budget, workspace contract, patch extraction, and evaluator for agent comparison. It includes a Lite version selected by a cost-aware, rank-aware procedure.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.