Custom Kernels for All from Codex and Claude
Summary
Hugging Face has developed an agent skill that enables coding agents like Claude and Codex to autonomously write production-grade CUDA kernels. Published on February 13, 2026, this skill addresses the complexity of integrating custom kernels with `transformers` and `diffusers` libraries, which typically involves intricate hardware-specific optimizations and PyTorch bindings. The agents successfully generated working RMSNorm kernels for the LTX-Video `diffusers` pipeline and the Qwen3-8B `transformers` model, complete with PyTorch bindings and benchmark scripts. Benchmarking on an H100 80GB GPU showed isolated RMSNorm kernel speedups averaging 1.88x for LTX-Video and 1.94x for Qwen3-8B, with an end-to-end video generation speedup of 1.06x for LTX-Video. The skill also facilitates publishing these kernels to the Hugging Face Kernel Hub for easy, compilation-free distribution.
Key takeaway
For NLP Engineers and Computer Vision Engineers seeking to optimize model performance, this agent skill offers a streamlined path to custom CUDA kernel development. You can leverage coding agents to generate, benchmark, and integrate highly optimized kernels for `transformers` and `diffusers` models, potentially achieving significant speedups. This approach reduces the manual effort and specialized knowledge traditionally required, allowing you to focus on higher-level model development while offloading low-level optimization to AI.
Key insights
Coding agents can autonomously generate optimized CUDA kernels and integrate them into complex ML frameworks.
Principles
- Agent skills package domain expertise for complex tasks.
- Hardware-specific optimizations are critical for kernel performance.
Method
Install the `cuda-kernels` skill into a coding agent, then prompt it to generate and benchmark kernels for specific models or pipelines, leveraging its structured guidance and reference scripts.
In practice
- Use `kernels skills add cuda-kernels` to install the skill.
- Prompt agents for H100-optimized RMSNorm kernels.
- Publish generated kernels to the Hugging Face Kernel Hub.
Topics
- CUDA Kernels
- AI Agents
- Hugging Face Diffusers
- Hugging Face Transformers
- Performance Optimization
Code references
Best for: NLP Engineer, Computer Vision Engineer, AI Engineer, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.