StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Summary
StitchCUDA is a multi-agent framework designed to automate end-to-end GPU program generation, addressing the limitations of existing LLM-based methods that primarily focus on single-kernel optimization. It features three specialized agents: a Planner for system design orchestration, a Coder for step-by-step implementation, and a Verifier for correctness checks and performance profiling using Nsys/NCU. To enhance the Coder's capabilities, StitchCUDA integrates rubric-based agentic reinforcement learning, which combines rubric rewards from an advanced LLM with rule-based rewards from real executions. This approach prevents reward hacking and encourages the implementation of advanced CUDA techniques like custom kernel fusion. Experiments on KernelBench demonstrate that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, delivering 1.72x better speedup over multi-agent baselines and 2.73x over RL model baselines.
Key takeaway
For AI Scientists and Research Scientists developing high-performance GPU applications, StitchCUDA offers a robust framework to overcome the challenges of end-to-end GPU program generation. You should consider adopting multi-agent systems with integrated reinforcement learning and rubric-based reward mechanisms to achieve superior performance and avoid common pitfalls like reward hacking, leading to more efficient and optimized CUDA code for complex workloads.
Key insights
StitchCUDA automates end-to-end GPU programming using a multi-agent framework with rubric-based reinforcement learning to enhance code optimization.
Principles
- End-to-end GPU program performance requires global coordination.
- Rubric-based rewards mitigate reward hacking and encourage advanced optimization.
- Decomposing agentic RL into atomic skills reduces training overhead.
Method
StitchCUDA employs a "plan–code–profile–refine" loop with Planner, Coder, and Verifier agents. The Coder is trained via rubric-based agentic RL on two atomic skills: from-scratch generation and feedback-driven optimization, using combined rule-based and expert-aligned rubric rewards.
In practice
- Use Nsys/NCU for detailed GPU performance profiling.
- Implement custom kernel fusion and cuBLAS epilogues for speedup.
- Employ mixed-precision computing for performance and stability.
Topics
- GPU Programming Automation
- Multi-Agent Systems
- Rubric-based Reinforcement Learning
- CUDA Kernel Optimization
- End-to-End GPU Performance
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.