Speeding up GPU kernels by 38% with a multi-agent system
Summary
A multi-agent system autonomously optimized 235 CUDA kernels for NVIDIA Blackwell 200 GPUs, achieving a 38% geometric mean speedup over baselines in just three weeks. This system, developed in collaboration with NVIDIA, addressed complex kernel optimization problems that typically require months or years of work from highly experienced engineers. The multi-agent harness operated autonomously, building and optimizing kernels down to the assembly level. It utilized NVIDIA's SOL-ExecBench to generate real-world optimization problems from over 124 production open-source models like Deepseek and Gemma, and to benchmark solutions on 27 Blackwell 200 GPUs. The system successfully outperformed baselines on 149 out of 235 problems (63%), with 19% of optimizations exceeding 2x improvements, demonstrating its capability to explore a broader solution space beyond manual simplifications.
Key takeaway
For AI Engineers and MLOps professionals focused on GPU performance, this multi-agent system demonstrates a significant shift in kernel optimization. You should consider exploring multi-agent architectures for tackling long-tail optimization problems that are impractical with traditional manual approaches, potentially reducing latency and cost per token for your AI model training and inference workloads on NVIDIA GPUs.
Key insights
Multi-agent systems can autonomously optimize complex software, achieving significant performance gains in weeks.
Principles
- Open-ended optimization problems evaluate long-running multi-agent systems.
- Multi-agent systems can learn novel APIs from documentation.
- Performance gains are possible by optimizing across entire systems simultaneously.
Method
A planner agent distributes and rebalances work across autonomous workers based on performance metrics, continuously testing, debugging, and optimizing kernels without developer intervention, using a single markdown file for coordination.
In practice
- Optimize LLM inference stacks for longer contexts and higher throughput.
- Fuse scale calculation and rounding for NVFP4 quantization bottlenecks.
- Generate specialized GEMM kernels for small-M test cases.
Topics
- Multi-Agent Systems
- GPU Kernel Optimization
- NVIDIA Blackwell 200 GPUs
- CUDA Kernels
- AI Model Inference
Code references
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.