TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

TensorBench is a new benchmark comprising 199 feature-addition and refactoring tasks for Scorch, an open-source compiler-based tensor framework extending PyTorch with first-class dense and sparse tensor support. It addresses the trade-off in repository-level coding benchmarks by evaluating coding agents on tasks covering new sparse formats, dense optimization passes, and IR transformations. The benchmark grades agents by applying their patch and running the framework's test suite, which includes pre-existing randomized regression tests and agent-added checks. Evaluation of seven coding agents, spanning three frontier model families and one open-weight model, revealed pass rates from 22.1% to 64.8%. The strongest agent, Claude 4.7, achieved 64.8%, improving by 22.1 points over its predecessor. Inter-agent agreement was low, with pairwise Cohen's κ ranging from -0.07 to 0.43, indicating agents pass different task subsets.

Key takeaway

For AI Engineers developing or evaluating coding agents, you should prioritize benchmarks that combine challenging repository-level tasks with rigorous, behavior-preserving evaluation criteria like TensorBench. The low inter-agent agreement suggests that combining outputs from multiple agents could significantly increase overall task completion rates. Furthermore, focus on agents that demonstrate strong performance on tasks requiring coordinated changes across compiler components, as these represent a current frontier for capability gains.

Key insights

Repository-level coding benchmarks require difficult tasks with robust, behavior-preserving evaluation to separate capability from noise.

Principles

Repository-level benchmarks need difficult tasks with strong evaluation signals.
Inherited regression suites are crucial for detecting mistakes in shared compiler machinery.
Agent-added tests primarily constrain new-feature behavior.

Method

TensorBench grades by applying an agent's patch to a repository, rebuilding the C++ runtime, and executing the full test suite, including both pre-existing randomized regression tests and agent-added feature tests.

In practice

Extend user-facing APIs for tensor operations.
Implement new sparse tensor formats like CSC or octree.
Add scheduler optimization passes for loop transformations.

Topics

Coding Agents
LLM Benchmarking
Tensor Frameworks
Compiler Optimization
Sparse Tensors
Code Generation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.