TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
Summary
TensorBench is a new benchmark comprising 199 feature-addition and refactoring tasks for Scorch, an open-source compiler-based tensor framework extending PyTorch with first-class dense and sparse tensor support. It addresses the trade-off in repository-level coding benchmarks by evaluating coding agents on tasks covering new sparse formats, dense optimization passes, and IR transformations. The benchmark grades agents by applying their patch and running the framework's test suite, which includes pre-existing randomized regression tests and agent-added checks. Evaluation of seven coding agents, spanning three frontier model families and one open-weight model, revealed pass rates from 22.1% to 64.8%. The strongest agent, Claude 4.7, achieved 64.8%, improving by 22.1 points over its predecessor. Inter-agent agreement was low, with pairwise Cohen's κ ranging from -0.07 to 0.43, indicating agents pass different task subsets.
Key takeaway
For AI Engineers developing or evaluating coding agents, you should prioritize benchmarks that combine challenging repository-level tasks with rigorous, behavior-preserving evaluation criteria like TensorBench. The low inter-agent agreement suggests that combining outputs from multiple agents could significantly increase overall task completion rates. Furthermore, focus on agents that demonstrate strong performance on tasks requiring coordinated changes across compiler components, as these represent a current frontier for capability gains.
Key insights
Repository-level coding benchmarks require difficult tasks with robust, behavior-preserving evaluation to separate capability from noise.
Principles
- Repository-level benchmarks need difficult tasks with strong evaluation signals.
- Inherited regression suites are crucial for detecting mistakes in shared compiler machinery.
- Agent-added tests primarily constrain new-feature behavior.
Method
TensorBench grades by applying an agent's patch to a repository, rebuilding the C++ runtime, and executing the full test suite, including both pre-existing randomized regression tests and agent-added feature tests.
In practice
- Extend user-facing APIs for tensor operations.
- Implement new sparse tensor formats like CSC or octree.
- Add scheduler optimization passes for loop transformations.
Topics
- Coding Agents
- LLM Benchmarking
- Tensor Frameworks
- Compiler Optimization
- Sparse Tensors
- Code Generation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.