TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

TensorBench is a new benchmark designed to evaluate coding agents on a compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. It comprises 199 feature-addition and refactoring tasks, covering areas like new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. Evaluation involves applying an agent's code patch and running the framework's comprehensive test suite, which includes both pre-existing randomized regression tests and any tests added by the agent. For feature-addition tasks, a pass requires preserving existing behavior and satisfying agent-added checks. Initial evaluations of seven coding agents, spanning three frontier model families and one open-weight model, revealed pass rates ranging from 64.8% for the strongest to 22.1% for the weakest. Notably, agents demonstrated varied performance across tasks, with pairwise Cohen's κ values between -0.07 and 0.43, and κ=0.05 for the two strongest agents, indicating diverse strengths.

Key takeaway

For AI Engineers evaluating or developing coding agents, TensorBench offers a robust method to assess performance on complex, real-world compiler framework tasks. You should consider using this benchmark to rigorously compare agent capabilities, especially for tasks involving sparse formats or IR transformations. The observed diversity in agent pass rates underscores the importance of comprehensive testing and suggests that no single agent currently excels across all coding challenges.

Key insights

TensorBench provides a robust, scalable benchmark for coding agents on complex compiler-based tensor framework tasks.

Principles

Comprehensive test suites enhance coding agent evaluation reliability.
Diverse agent performance suggests varied underlying capabilities.

Method

TensorBench evaluates agents by applying their patch to a PyTorch-extended framework and running its full test suite, including agent-added tests, to verify feature-addition and refactoring tasks.

In practice

Integrate agent-added tests for robust feature validation.
Benchmark coding agents on compiler-based framework tasks.

Topics

Coding Agents
TensorBench
Compiler Frameworks
PyTorch Extension
Sparse Tensors
Benchmark Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.