AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

AgentKernelArena is an open-source benchmark designed to evaluate AI coding agents on GPU kernel optimization tasks, addressing limitations of prior benchmarks that focused on single LLM calls or lacked generalization testing. The benchmark features 196 tasks across three categories: HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. It evaluates complete agent workflows in isolated workspaces, employing gated compilation, correctness, and performance checks. A key innovation is an unseen-configuration generalization protocol, which tests whether optimizations transfer to input configurations the agent never observed. Experiments with production agents like Cursor Agent, Claude Code, and Codex Agent on AMD Instinct MI300X GPUs show high compilation and correctness rates. The strongest configurations achieved mean speedups up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. However, PyTorch-to-HIP tasks exhibited substantial correctness drops on unseen configurations, indicating agents often hardcode shape-specific assumptions when generating kernels from scratch.

Key takeaway

For AI Engineers developing or deploying GPU kernel optimization agents, AgentKernelArena highlights the critical need to evaluate beyond seen input configurations. Your agents may achieve impressive speedups on known shapes, but the benchmark reveals significant correctness regressions (up to 40% for PyTorch-to-HIP tasks) when agents generate kernels from scratch and hardcode shape-specific assumptions. Prioritize unseen-configuration generalization testing to ensure the robustness and reliability of agent-generated kernels in production environments, especially for tasks involving code generation from high-level specifications.

Key insights

AgentKernelArena benchmarks AI agents on GPU kernel optimization, revealing strong performance but critical generalization gaps on unseen input configurations.

Principles

Method

AgentKernelArena evaluates agents in isolated workspaces through a gated compile-correctness-performance pipeline, using a scoring function and an unseen-configuration generalization protocol to measure robustness.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.