StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

2026-03-04 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

StitchCUDA is a multi-agent framework designed to automate end-to-end GPU program generation, addressing the limitations of existing LLM-based methods that primarily focus on single-kernel optimization. It features three specialized agents: a Planner for system design orchestration, a Coder for step-by-step implementation, and a Verifier for correctness checks and performance profiling using Nsys/NCU. To enhance the Coder's capabilities, StitchCUDA integrates rubric-based agentic reinforcement learning, which combines rubric rewards from an advanced LLM with rule-based rewards from real executions. This approach prevents reward hacking and encourages the implementation of advanced CUDA techniques like custom kernel fusion. Experiments on KernelBench demonstrate that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, delivering 1.72x better speedup over multi-agent baselines and 2.73x over RL model baselines.

Key takeaway

For AI Scientists and Research Scientists developing high-performance GPU applications, StitchCUDA offers a robust framework to overcome the challenges of end-to-end GPU program generation. You should consider adopting multi-agent systems with integrated reinforcement learning and rubric-based reward mechanisms to achieve superior performance and avoid common pitfalls like reward hacking, leading to more efficient and optimized CUDA code for complex workloads.

Key insights

StitchCUDA automates end-to-end GPU programming using a multi-agent framework with rubric-based reinforcement learning to enhance code optimization.

Principles

End-to-end GPU program performance requires global coordination.
Rubric-based rewards mitigate reward hacking and encourage advanced optimization.
Decomposing agentic RL into atomic skills reduces training overhead.

Method

StitchCUDA employs a "plan–code–profile–refine" loop with Planner, Coder, and Verifier agents. The Coder is trained via rubric-based agentic RL on two atomic skills: from-scratch generation and feedback-driven optimization, using combined rule-based and expert-aligned rubric rewards.

In practice

Use Nsys/NCU for detailed GPU performance profiling.
Implement custom kernel fusion and cuBLAS epilogues for speedup.
Employ mixed-precision computing for performance and stability.

Topics

GPU Programming Automation
Multi-Agent Systems
Rubric-based Reinforcement Learning
CUDA Kernel Optimization
End-to-End GPU Performance

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.