Custom Kernels for All from Codex and Claude

2026-02-16 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

Hugging Face has developed an agent skill that enables coding agents like Claude and Codex to autonomously write production-grade CUDA kernels. Published on February 13, 2026, this skill addresses the complexity of integrating custom kernels with `transformers` and `diffusers` libraries, which typically involves intricate hardware-specific optimizations and PyTorch bindings. The agents successfully generated working RMSNorm kernels for the LTX-Video `diffusers` pipeline and the Qwen3-8B `transformers` model, complete with PyTorch bindings and benchmark scripts. Benchmarking on an H100 80GB GPU showed isolated RMSNorm kernel speedups averaging 1.88x for LTX-Video and 1.94x for Qwen3-8B, with an end-to-end video generation speedup of 1.06x for LTX-Video. The skill also facilitates publishing these kernels to the Hugging Face Kernel Hub for easy, compilation-free distribution.

Key takeaway

For NLP Engineers and Computer Vision Engineers seeking to optimize model performance, this agent skill offers a streamlined path to custom CUDA kernel development. You can leverage coding agents to generate, benchmark, and integrate highly optimized kernels for `transformers` and `diffusers` models, potentially achieving significant speedups. This approach reduces the manual effort and specialized knowledge traditionally required, allowing you to focus on higher-level model development while offloading low-level optimization to AI.

Key insights

Coding agents can autonomously generate optimized CUDA kernels and integrate them into complex ML frameworks.

Principles

Agent skills package domain expertise for complex tasks.
Hardware-specific optimizations are critical for kernel performance.

Method

Install the `cuda-kernels` skill into a coding agent, then prompt it to generate and benchmark kernels for specific models or pipelines, leveraging its structured guidance and reference scripts.

In practice

Use `kernels skills add cuda-kernels` to install the skill.
Prompt agents for H100-optimized RMSNorm kernels.
Publish generated kernels to the Hugging Face Kernel Hub.

Topics

CUDA Kernels
AI Agents
Hugging Face Diffusers
Hugging Face Transformers
Performance Optimization

Code references

Best for: NLP Engineer, Computer Vision Engineer, AI Engineer, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.