Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
Summary
Sakana AI and NVIDIA have introduced TwELL, a new sparse format and set of CUDA kernels designed to exploit sparsity in large language models (LLMs) on modern GPUs. Feedforward layers constitute over 80% of LLM compute, with much of this on zero-value activations. TwELL integrates directly into the matmul kernel epilogue, eliminating overheads previously associated with sparse operations on NVIDIA GPUs. This innovation includes a fused inference kernel that processes gate activations in TwELL format, preventing hidden states from being written to global memory, and a hybrid sparse format for training that dynamically routes rows. For a 2B model on H100 PCIe, TwELL achieved a 20.5% inference throughput increase, a 21.9% training step throughput increase, and a 17.0% reduction in energy per token, with a minimal accuracy drop from 49.1% to 48.8%. The gains scale with model size, as average non-zero activations decrease from 39 (0.5B) to 24 (2B).
Key takeaway
For AI Engineers optimizing LLM performance on NVIDIA GPUs, adopting Sakana AI and NVIDIA's TwELL kernels can significantly boost inference and training throughput while reducing energy consumption. Your teams should investigate integrating these open-source kernels, especially for larger models where efficiency gains are more pronounced. Consider the minor accuracy trade-off against the substantial speed and energy benefits.
Key insights
TwELL and new CUDA kernels exploit LLM sparsity for significant speedups and energy savings on NVIDIA GPUs.
Principles
- Sparsity in LLMs is exploitable for efficiency.
- Fused kernel operations reduce memory overhead.
- Dynamic sparse formats improve training robustness.
Method
TwELL uses a tile-wise ELLPACK format within the matmul kernel epilogue, a fused inference kernel, and a hybrid sparse format for training, alongside a modified training recipe (ReLU, L1 regularization).
In practice
- Replace SiLU with ReLU in LLM training.
- Add L1 regularization at coefficient 2×10⁻⁵.
- Utilize open-sourced TwELL kernels for LLM optimization.
Topics
- Sakana AI
- NVIDIA
- TwELL
- CUDA Kernels
- LLM Inference
Code references
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.