TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

TensorBench is a new benchmark comprising 199 feature-addition and refactoring tasks for Scorch, an open-source compiler-based tensor framework extending PyTorch with first-class dense and sparse tensor support. It addresses the trade-off in repository-level coding benchmarks by evaluating coding agents on tasks covering new sparse formats, dense optimization passes, and IR transformations. The benchmark grades agents by applying their patch and running the framework's test suite, which includes pre-existing randomized regression tests and agent-added checks. Evaluation of seven coding agents, spanning three frontier model families and one open-weight model, revealed pass rates from 22.1% to 64.8%. The strongest agent, Claude 4.7, achieved 64.8%, improving by 22.1 points over its predecessor. Inter-agent agreement was low, with pairwise Cohen's κ ranging from -0.07 to 0.43, indicating agents pass different task subsets.

Key takeaway

For AI Engineers developing or evaluating coding agents, you should prioritize benchmarks that combine challenging repository-level tasks with rigorous, behavior-preserving evaluation criteria like TensorBench. The low inter-agent agreement suggests that combining outputs from multiple agents could significantly increase overall task completion rates. Furthermore, focus on agents that demonstrate strong performance on tasks requiring coordinated changes across compiler components, as these represent a current frontier for capability gains.

Key insights

Repository-level coding benchmarks require difficult tasks with robust, behavior-preserving evaluation to separate capability from noise.

Principles

Method

TensorBench grades by applying an agent's patch to a repository, rebuilding the C++ runtime, and executing the full test suite, including both pre-existing randomized regression tests and agent-added feature tests.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.