The Correctness Illusion in LLM-Generated GPU Kernels
Summary
Existing benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness using fixed-shape, small-sample "allclose"-style checks, which this research empirically demonstrates are systematically optimistic. Researchers constructed a controlled corpus of 24 (later 26) Triton and CPU stand-in kernels, comprising 15 (later 16) correct controls and 9 (later 10) LLM-style buggy variants, seeded with documented transcription errors like "missing 0.5\u00d7" in GELU or "missing 1/\u221aD" in attention. A new op-schema-aware seeded fuzzing oracle, utilizing an fp64 CPU reference and per-(op, dtype) absolute tolerances, successfully flagged all 9 buggy kernels and passed all 15 correct controls on an RTX 3060 GPU. An extended evaluation on 26 ops across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL) yielded identical verdicts: 10 of 10 illusions caught and 16 of 16 controls clean.
Key takeaway
For Machine Learning Engineers validating GPU kernels, relying solely on fixed-shape, small-sample "allclose"-style benchmarks creates a "correctness illusion." Your kernels may harbor undetected bugs, especially shape-dependent or mixed-precision errors. You should integrate op-schema-aware fuzzing with high-precision fp64 references and per-(op, dtype) absolute tolerances into your validation pipeline to ensure robust numerical correctness across diverse inputs. This approach catches critical bugs that traditional methods miss.
Key insights
Existing LLM GPU kernel benchmarks create a "correctness illusion" by failing to detect common bugs due to limited testing.
Principles
- Fixed-shape, small-sample "allclose" checks are insufficient for GPU kernel correctness.
- Op-schema-aware fuzzing with boundary values improves bug detection.
- High-precision references and per-(op, dtype) absolute tolerances are crucial.
Method
The proposed method uses op-schema-aware seeded fuzzing with an fp64 CPU reference and per-(op, dtype) absolute tolerances to validate GPU kernels, covering diverse shapes and dtypes.
In practice
- Implement op-schema-aware fuzzing for kernel validation.
- Use fp64 references for high-precision correctness checks.
- Calibrate absolute tolerances per operation and data type.
Topics
- LLM-Generated Kernels
- GPU Kernel Testing
- Fuzzing
- Numerical Accuracy
- Triton
- Benchmarking
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.