Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
Summary
KernelPro is a closed-loop multi-agent system designed to automatically generate, profile, and iteratively optimize GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and pluggable bottleneck detection tools. The system introduces a semantic feedback operator that translates raw hardware metrics into actionable natural language guidance using expert heuristics, alongside a two-stage tool invocation architecture that classifies bottlenecks to orchestrate specialized profiling tools like ncu, SASS, and nsys. KernelPro also features a domain-adapted Monte Carlo Tree Search (MCTS) for cross-iteration learning and direct CuTe source-level code generation. On KernelBench, it achieved geometric mean speedups of 2.42x, 4.69x, and 5.30x on Levels 1, 2, and 3, respectively, setting new state-of-the-art performance. It also delivered a 1.23x speedup over hand-tuned Triton on VeOmni's expert-optimized MoE training kernels and demonstrated an 11.6% measured energy reduction at matched speed, a first for CUDA kernel coding agents.
Key takeaway
For Machine Learning Engineers optimizing GPU kernel performance, this research indicates that LLM-driven systems can significantly outperform hand-tuned code and achieve energy efficiency. You should consider integrating automated micro-profiling tools with LLM-based iterative refinement into your kernel development pipeline to achieve substantial speedups and reduce energy consumption, especially for complex architectures like Hopper WGMMA kernels. This approach offers a path to state-of-the-art optimization beyond manual efforts.
Key insights
LLMs can optimize GPU kernels by integrating with micro-profiling tools and iterative search, achieving significant speedups and energy efficiency.
Principles
- Expert heuristics can guide LLM optimization.
- Multi-stage profiling improves bottleneck detection.
- Iterative search refines LLM-generated code.
Method
KernelPro uses a closed-loop multi-agent system, combining LLM code generation with hardware profiler feedback, semantic feedback, and a domain-adapted MCTS for iterative GPU kernel optimization.
In practice
- Integrate profiler feedback into LLM workflows.
- Use roofline models for bottleneck classification.
- Explore MCTS for code optimization search.
Topics
- CUDA Kernel Optimization
- Large Language Models
- GPU Profiling
- Monte Carlo Tree Search
- Energy Efficiency
- Code Generation
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.