Speeding up GPU kernels by 38% with a multi-agent system

2026-04-14 · Source: Cursor Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, medium

Summary

A multi-agent system autonomously optimized 235 CUDA kernels for NVIDIA Blackwell 200 GPUs, achieving a 38% geometric mean speedup over baselines in just three weeks. This system, developed in collaboration with NVIDIA, addressed complex kernel optimization problems that typically require months or years of work from highly experienced engineers. The multi-agent harness operated autonomously, building and optimizing kernels down to the assembly level. It utilized NVIDIA's SOL-ExecBench to generate real-world optimization problems from over 124 production open-source models like Deepseek and Gemma, and to benchmark solutions on 27 Blackwell 200 GPUs. The system successfully outperformed baselines on 149 out of 235 problems (63%), with 19% of optimizations exceeding 2x improvements, demonstrating its capability to explore a broader solution space beyond manual simplifications.

Key takeaway

For AI Engineers and MLOps professionals focused on GPU performance, this multi-agent system demonstrates a significant shift in kernel optimization. You should consider exploring multi-agent architectures for tackling long-tail optimization problems that are impractical with traditional manual approaches, potentially reducing latency and cost per token for your AI model training and inference workloads on NVIDIA GPUs.

Key insights

Multi-agent systems can autonomously optimize complex software, achieving significant performance gains in weeks.

Principles

Open-ended optimization problems evaluate long-running multi-agent systems.
Multi-agent systems can learn novel APIs from documentation.
Performance gains are possible by optimizing across entire systems simultaneously.

Method

A planner agent distributes and rebalances work across autonomous workers based on performance metrics, continuously testing, debugging, and optimizing kernels without developer intervention, using a single markdown file for coordination.

In practice

Optimize LLM inference stacks for longer contexts and higher throughput.
Fuse scale calculation and rounding for NVFP4 quantization bottlenecks.
Generate specialized GEMM kernels for small-M test cases.

Topics

Multi-Agent Systems
GPU Kernel Optimization
NVIDIA Blackwell 200 GPUs
CUDA Kernels
AI Model Inference

Code references

anysphere/kernel-optimization-results

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.