TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
Summary
TensorBench is a new benchmark designed to evaluate coding agents on a compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. It comprises 199 feature-addition and refactoring tasks, covering areas like new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. Evaluation involves applying an agent's code patch and running the framework's comprehensive test suite, which includes both pre-existing randomized regression tests and any tests added by the agent. For feature-addition tasks, a pass requires preserving existing behavior and satisfying agent-added checks. Initial evaluations of seven coding agents, spanning three frontier model families and one open-weight model, revealed pass rates ranging from 64.8% for the strongest to 22.1% for the weakest. Notably, agents demonstrated varied performance across tasks, with pairwise Cohen's κ values between -0.07 and 0.43, and κ=0.05 for the two strongest agents, indicating diverse strengths.
Key takeaway
For AI Engineers evaluating or developing coding agents, TensorBench offers a robust method to assess performance on complex, real-world compiler framework tasks. You should consider using this benchmark to rigorously compare agent capabilities, especially for tasks involving sparse formats or IR transformations. The observed diversity in agent pass rates underscores the importance of comprehensive testing and suggests that no single agent currently excels across all coding challenges.
Key insights
TensorBench provides a robust, scalable benchmark for coding agents on complex compiler-based tensor framework tasks.
Principles
- Comprehensive test suites enhance coding agent evaluation reliability.
- Diverse agent performance suggests varied underlying capabilities.
Method
TensorBench evaluates agents by applying their patch to a PyTorch-extended framework and running its full test suite, including agent-added tests, to verify feature-addition and refactoring tasks.
In practice
- Integrate agent-added tests for robust feature validation.
- Benchmark coding agents on compiler-based framework tasks.
Topics
- Coding Agents
- TensorBench
- Compiler Frameworks
- PyTorch Extension
- Sparse Tensors
- Benchmark Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.