OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization
Summary
OptiML is an end-to-end framework designed to generate and optimize high-performance CUDA kernels from either natural language descriptions or existing CUDA code. It addresses the challenge of achieving competitive performance in CUDA kernels, especially those synthesized by Large Language Models (LLMs), by formulating optimization as a "search under verification" problem. The framework comprises two decoupled stages: OptiML-G, a Mixture-of-Thoughts generator, which acts as a proposal policy for initial executable kernel strategies when the input is natural language, and OptiML-X, a search-based optimizer. OptiML-X refines kernels using Monte Carlo Tree Search (MCTS) guided by LLM-driven edits and a hardware-aware reward derived from Nsight Compute profiler feedback. Each candidate transformation undergoes compilation, verification, and profiling, evaluated by a composite objective combining runtime with hardware bottleneck proxies and guardrails. OptiML consistently discovers verified performance improvements over strong LLM baselines on an NVIDIA A100 80GB GPU, producing interpretable optimization trajectories.
Key takeaway
For AI Scientists developing or optimizing CUDA kernels, OptiML demonstrates that combining LLM-based code generation with a hardware-aware, search-driven optimization framework significantly improves performance and reliability. You should consider integrating profiling-guided search techniques into your kernel development workflow to move beyond functionally correct code to performance-optimized implementations, especially when starting from LLM-generated code. This approach helps identify and alleviate specific hardware bottlenecks, leading to more efficient and robust kernels.
Key insights
OptiML unifies LLM-driven code generation with hardware-aware search to optimize CUDA kernels for performance.
Principles
- Combine LLM generation with search-based optimization.
- Ground optimization in hardware profiling feedback.
- Use Monte Carlo Tree Search for multi-step transformations.
Method
OptiML-G synthesizes initial kernels via a Mixture-of-Thoughts model. OptiML-X then refines these or user-provided kernels using MCTS, LLM-driven edits, and a composite reward from runtime, Nsight Compute metrics, and an LLM-as-a-Judge.
In practice
- Target specific bottlenecks (e.g., memory traffic, instruction footprint).
- Employ multi-level correctness testing (L0/L1/L2).
- Utilize an LLM-as-a-Judge for evaluating code edits.
Topics
- CUDA Kernel Optimization
- Program Synthesis
- Large Language Models
- Monte Carlo Tree Search
- Hardware-Aware Optimization
Best for: AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.