Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

2026-05-11 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Sakana AI and NVIDIA have introduced TwELL, a new sparse format and set of CUDA kernels designed to exploit sparsity in large language models (LLMs) on modern GPUs. Feedforward layers constitute over 80% of LLM compute, with much of this on zero-value activations. TwELL integrates directly into the matmul kernel epilogue, eliminating overheads previously associated with sparse operations on NVIDIA GPUs. This innovation includes a fused inference kernel that processes gate activations in TwELL format, preventing hidden states from being written to global memory, and a hybrid sparse format for training that dynamically routes rows. For a 2B model on H100 PCIe, TwELL achieved a 20.5% inference throughput increase, a 21.9% training step throughput increase, and a 17.0% reduction in energy per token, with a minimal accuracy drop from 49.1% to 48.8%. The gains scale with model size, as average non-zero activations decrease from 39 (0.5B) to 24 (2B).

Key takeaway

For AI Engineers optimizing LLM performance on NVIDIA GPUs, adopting Sakana AI and NVIDIA's TwELL kernels can significantly boost inference and training throughput while reducing energy consumption. Your teams should investigate integrating these open-source kernels, especially for larger models where efficiency gains are more pronounced. Consider the minor accuracy trade-off against the substantial speed and energy benefits.

Key insights

TwELL and new CUDA kernels exploit LLM sparsity for significant speedups and energy savings on NVIDIA GPUs.

Principles

Sparsity in LLMs is exploitable for efficiency.
Fused kernel operations reduce memory overhead.
Dynamic sparse formats improve training robustness.

Method

TwELL uses a tile-wise ELLPACK format within the matmul kernel epilogue, a fused inference kernel, and a hybrid sparse format for training, alongside a modified training recipe (ReLU, L1 regularization).

In practice

Replace SiLU with ReLU in LLM training.
Add L1 regularization at coefficient 2×10⁻⁵.
Utilize open-sourced TwELL kernels for LLM optimization.

Topics

Sakana AI
NVIDIA
TwELL
CUDA Kernels
LLM Inference

Code references

SakanaAI/sparser-faster-llms

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.