CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

CuTeGen is an agentic framework that automates the generation and optimization of high-performance GPU kernels, addressing the challenges of developing efficient implementations. It employs a structured generate–test–refine workflow, progressively refining a single kernel through execution-based validation, debugging, and staged optimization. A key design choice involves generating kernels using the CuTe abstraction layer, which provides a stable representation for iterative modification by exposing performance-critical structures like tiling and data movement. CuTeGen also incorporates workload-aware optimization prompts and a delayed integration of NVIDIA Nsight Compute profiling feedback. Experimental results on an NVIDIA GeForce RTX 4090 GPU, using 12 matrix multiplication and 14 activation kernels from KernelBench, show an average speedup of 1.70x over PyTorch reference implementations for activation kernels. For matrix multiplication, CuTeGen even outperforms cuBLAS on two benchmark cases, demonstrating its ability to produce functionally correct and competitively performing kernels.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing GPU workloads, CuTeGen demonstrates that agentic LLM frameworks can significantly reduce manual kernel engineering. You should consider adopting structured abstraction layers like CuTe to provide a more stable foundation for iterative refinement, especially when dealing with complex operations like matrix multiplication. Furthermore, strategically delaying performance profiling until a kernel's structural design is sound can prevent premature optimization and lead to more effective performance gains.

Key insights

CuTeGen iteratively refines GPU kernels using CuTe abstraction and delayed profiling, achieving competitive performance.

Principles

Iterative refinement via execution feedback improves kernel quality.
CuTe abstraction layer provides stable, low-level control for LLMs.
Delaying profiling for complex kernels avoids premature local optima.

Method

CuTeGen synthesizes CuTe-based kernels, validates correctness via compile/execute/test, debugs with structured patches, then optimizes using prompt-guided transformations and delayed profiling feedback.

In practice

Adopt CuTe for GPU kernel generation and refinement.
Delay profiling for structurally complex kernel optimizations.

Topics

GPU Kernel Optimization
Large Language Models
CuTe Abstraction Layer
Agentic AI Systems
Performance Profiling
Matrix Multiplication
Deep Learning Workloads

Code references

NVIDIA/cutlass

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.