Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

2026-06-24 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

KernelPro is a closed-loop multi-agent system designed to automatically generate, profile, and iteratively optimize GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and pluggable bottleneck detection tools. The system introduces a semantic feedback operator that translates raw hardware metrics into actionable natural language guidance using expert heuristics, alongside a two-stage tool invocation architecture that classifies bottlenecks to orchestrate specialized profiling tools like ncu, SASS, and nsys. KernelPro also features a domain-adapted Monte Carlo Tree Search (MCTS) for cross-iteration learning and direct CuTe source-level code generation. On KernelBench, it achieved geometric mean speedups of 2.42x, 4.69x, and 5.30x on Levels 1, 2, and 3, respectively, setting new state-of-the-art performance. It also delivered a 1.23x speedup over hand-tuned Triton on VeOmni's expert-optimized MoE training kernels and demonstrated an 11.6% measured energy reduction at matched speed, a first for CUDA kernel coding agents.

Key takeaway

For Machine Learning Engineers optimizing GPU kernel performance, this research indicates that LLM-driven systems can significantly outperform hand-tuned code and achieve energy efficiency. You should consider integrating automated micro-profiling tools with LLM-based iterative refinement into your kernel development pipeline to achieve substantial speedups and reduce energy consumption, especially for complex architectures like Hopper WGMMA kernels. This approach offers a path to state-of-the-art optimization beyond manual efforts.

Key insights

LLMs can optimize GPU kernels by integrating with micro-profiling tools and iterative search, achieving significant speedups and energy efficiency.

Principles

Expert heuristics can guide LLM optimization.
Multi-stage profiling improves bottleneck detection.
Iterative search refines LLM-generated code.

Method

KernelPro uses a closed-loop multi-agent system, combining LLM code generation with hardware profiler feedback, semantic feedback, and a domain-adapted MCTS for iterative GPU kernel optimization.

In practice

Integrate profiler feedback into LLM workflows.
Use roofline models for bottleneck classification.
Explore MCTS for code optimization search.

Topics

CUDA Kernel Optimization
Large Language Models
GPU Profiling
Monte Carlo Tree Search
Energy Efficiency
Code Generation

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.