daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization
Summary
daVinci-kernel is a reinforcement learning framework designed for GPU kernel optimization, focusing on execution efficiency while assuming functional correctness. This system co-evolves skill selection, summarization, and utilization through a dynamically evolving skill library. It integrates three agents sharing a single LLM backbone: a Skill Selection Agent that retrieves relevant techniques using BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels based on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are only incorporated after execution-based verification confirms reproducible speedups. The agents are initialized via a structured SFT cold start on diversity-filtered data and jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieved 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, surpassing Dr.Kernel-14B.
Key takeaway
For Machine Learning Engineers or AI Hardware Engineers focused on GPU kernel optimization, daVinci-kernel demonstrates a powerful RL approach. You should consider integrating similar co-evolutionary skill learning frameworks to achieve significant execution efficiency gains. This method, which verifies speedups before skill adoption, offers a robust path to automating complex kernel optimizations, potentially surpassing current RL-trained models like Dr.Kernel-14B. Explore its multi-agent, LLM-backed architecture for your next performance-critical projects.
Key insights
A reinforcement learning framework co-evolves skill selection, summarization, and utilization for GPU kernel optimization.
Principles
- Couple skill discovery with exploitation.
- Verify new skills via execution for speedups.
- Jointly train specialized agents with shared LLM.
Method
Initialize three LLM-backed agents via SFT cold start, then jointly optimize with multi-turn REINFORCE, adding skills only after verified speedups.
In practice
- Generate optimized CUDA/Triton kernels.
- Use LLM reranking for technique retrieval.
- Apply RL for code optimization tasks.
Topics
- GPU Kernel Optimization
- Reinforcement Learning
- Large Language Models
- Code Generation
- Performance Optimization
- Skill Discovery
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.