OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

2026-02-16 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

OptiML is an end-to-end framework designed to generate and optimize high-performance CUDA kernels from either natural language descriptions or existing CUDA code. It addresses the challenge of achieving competitive performance in CUDA kernels, especially those synthesized by Large Language Models (LLMs), by formulating optimization as a "search under verification" problem. The framework comprises two decoupled stages: OptiML-G, a Mixture-of-Thoughts generator, which acts as a proposal policy for initial executable kernel strategies when the input is natural language, and OptiML-X, a search-based optimizer. OptiML-X refines kernels using Monte Carlo Tree Search (MCTS) guided by LLM-driven edits and a hardware-aware reward derived from Nsight Compute profiler feedback. Each candidate transformation undergoes compilation, verification, and profiling, evaluated by a composite objective combining runtime with hardware bottleneck proxies and guardrails. OptiML consistently discovers verified performance improvements over strong LLM baselines on an NVIDIA A100 80GB GPU, producing interpretable optimization trajectories.

Key takeaway

For AI Scientists developing or optimizing CUDA kernels, OptiML demonstrates that combining LLM-based code generation with a hardware-aware, search-driven optimization framework significantly improves performance and reliability. You should consider integrating profiling-guided search techniques into your kernel development workflow to move beyond functionally correct code to performance-optimized implementations, especially when starting from LLM-generated code. This approach helps identify and alleviate specific hardware bottlenecks, leading to more efficient and robust kernels.

Key insights

OptiML unifies LLM-driven code generation with hardware-aware search to optimize CUDA kernels for performance.

Principles

Combine LLM generation with search-based optimization.
Ground optimization in hardware profiling feedback.
Use Monte Carlo Tree Search for multi-step transformations.

Method

OptiML-G synthesizes initial kernels via a Mixture-of-Thoughts model. OptiML-X then refines these or user-provided kernels using MCTS, LLM-driven edits, and a composite reward from runtime, Nsight Compute metrics, and an LLM-as-a-Judge.

In practice

Target specific bottlenecks (e.g., memory traffic, instruction footprint).
Employ multi-level correctness testing (L0/L1/L2).
Utilize an LLM-as-a-Judge for evaluating code edits.

Topics

CUDA Kernel Optimization
Program Synthesis
Large Language Models
Monte Carlo Tree Search
Hardware-Aware Optimization

Best for: AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.