The Correctness Illusion in LLM-Generated GPU Kernels

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

Existing benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness using fixed-shape, small-sample "allclose"-style checks, which this research empirically demonstrates are systematically optimistic. Researchers constructed a controlled corpus of 24 (later 26) Triton and CPU stand-in kernels, comprising 15 (later 16) correct controls and 9 (later 10) LLM-style buggy variants, seeded with documented transcription errors like "missing 0.5\u00d7" in GELU or "missing 1/\u221aD" in attention. A new op-schema-aware seeded fuzzing oracle, utilizing an fp64 CPU reference and per-(op, dtype) absolute tolerances, successfully flagged all 9 buggy kernels and passed all 15 correct controls on an RTX 3060 GPU. An extended evaluation on 26 ops across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL) yielded identical verdicts: 10 of 10 illusions caught and 16 of 16 controls clean.

Key takeaway

For Machine Learning Engineers validating GPU kernels, relying solely on fixed-shape, small-sample "allclose"-style benchmarks creates a "correctness illusion." Your kernels may harbor undetected bugs, especially shape-dependent or mixed-precision errors. You should integrate op-schema-aware fuzzing with high-precision fp64 references and per-(op, dtype) absolute tolerances into your validation pipeline to ensure robust numerical correctness across diverse inputs. This approach catches critical bugs that traditional methods miss.

Key insights

Existing LLM GPU kernel benchmarks create a "correctness illusion" by failing to detect common bugs due to limited testing.

Principles

Fixed-shape, small-sample "allclose" checks are insufficient for GPU kernel correctness.
Op-schema-aware fuzzing with boundary values improves bug detection.
High-precision references and per-(op, dtype) absolute tolerances are crucial.

Method

The proposed method uses op-schema-aware seeded fuzzing with an fp64 CPU reference and per-(op, dtype) absolute tolerances to validate GPU kernels, covering diverse shapes and dtypes.

In practice

Implement op-schema-aware fuzzing for kernel validation.
Use fp64 references for high-precision correctness checks.
Calibrate absolute tolerances per operation and data type.

Topics

LLM-Generated Kernels
GPU Kernel Testing
Fuzzing
Numerical Accuracy
Triton
Benchmarking

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.