The Correctness Illusion in LLM-Generated GPU Kernels
Summary
Existing benchmarks for LLM-generated GPU kernels, such as KernelBench, TritonBench, and GEAK, rely on fixed-shape, small-sample "allclose"-style checks, which can create a "correctness illusion." Researchers empirically tested this oracle by constructing a controlled corpus of 24 Triton and CPU kernels, comprising 15 correct controls and 9 LLM-style buggy variants with documented transcription errors. They re-evaluated these kernels using op-schema-aware seeded fuzzing, employing a high-precision fp64 CPU reference and per-(op, dtype) absolute tolerances. This seeded oracle successfully flagged all 9 buggy kernels and passed all 15 correct controls without precision cost. Extending the corpus to 26 operations, including a flash-attention pair, and re-running the protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL) yielded identical verdicts: 10 of 10 illusions caught and 16 of 16 controls clean. The findings highlight how "allclose-on-one-shape" oracles can incorrectly certify LLM-style transcription bugs as correct.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM-generated GPU kernels, you must recognize that standard "allclose"-style benchmarks can mask critical transcription errors. Your current correctness assessments might be flawed, leading to a "correctness illusion." You should integrate op-schema-aware seeded fuzzing with high-precision CPU references and per-operation, per-dtype absolute tolerances into your validation pipelines to ensure true functional correctness across diverse inputs and hardware.
Key insights
Existing LLM GPU kernel benchmarks create a "correctness illusion" by failing to detect common transcription errors with limited testing.
Principles
- Robust fuzzing is essential for code correctness.
- High-precision references validate generated kernels.
- Fixed-shape tests miss LLM transcription bugs.
Method
Evaluate LLM-generated GPU kernels using op-schema-aware seeded fuzzing, a high-precision fp64 CPU reference, and per-(op, dtype) absolute tolerances to detect transcription errors.
In practice
- Implement op-schema-aware seeded fuzzing.
- Use fp64 CPU references for kernel validation.
- Diversify test inputs beyond fixed shapes.
Topics
- LLM-Generated Code
- GPU Kernels
- Code Correctness
- Fuzzing
- Triton
- Benchmarking
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.