The Correctness Illusion in LLM-Generated GPU Kernels

· Source: Machine Learning · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Existing benchmarks for LLM-generated GPU kernels, such as KernelBench, TritonBench, and GEAK, rely on fixed-shape, small-sample "allclose"-style checks, which can create a "correctness illusion." Researchers empirically tested this oracle by constructing a controlled corpus of 24 Triton and CPU kernels, comprising 15 correct controls and 9 LLM-style buggy variants with documented transcription errors. They re-evaluated these kernels using op-schema-aware seeded fuzzing, employing a high-precision fp64 CPU reference and per-(op, dtype) absolute tolerances. This seeded oracle successfully flagged all 9 buggy kernels and passed all 15 correct controls without precision cost. Extending the corpus to 26 operations, including a flash-attention pair, and re-running the protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL) yielded identical verdicts: 10 of 10 illusions caught and 16 of 16 controls clean. The findings highlight how "allclose-on-one-shape" oracles can incorrectly certify LLM-style transcription bugs as correct.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM-generated GPU kernels, you must recognize that standard "allclose"-style benchmarks can mask critical transcription errors. Your current correctness assessments might be flawed, leading to a "correctness illusion." You should integrate op-schema-aware seeded fuzzing with high-precision CPU references and per-operation, per-dtype absolute tolerances into your validation pipelines to ensure true functional correctness across diverse inputs and hardware.

Key insights

Existing LLM GPU kernel benchmarks create a "correctness illusion" by failing to detect common transcription errors with limited testing.

Principles

Method

Evaluate LLM-generated GPU kernels using op-schema-aware seeded fuzzing, a high-precision fp64 CPU reference, and per-(op, dtype) absolute tolerances to detect transcription errors.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.