Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Sakana AI and NVIDIA have introduced TwELL, a new sparse format and set of CUDA kernels designed to exploit sparsity in large language models (LLMs) on modern GPUs. Feedforward layers constitute over 80% of LLM compute, with much of this on zero-value activations. TwELL integrates directly into the matmul kernel epilogue, eliminating overheads previously associated with sparse operations on NVIDIA GPUs. This innovation includes a fused inference kernel that processes gate activations in TwELL format, preventing hidden states from being written to global memory, and a hybrid sparse format for training that dynamically routes rows. For a 2B model on H100 PCIe, TwELL achieved a 20.5% inference throughput increase, a 21.9% training step throughput increase, and a 17.0% reduction in energy per token, with a minimal accuracy drop from 49.1% to 48.8%. The gains scale with model size, as average non-zero activations decrease from 39 (0.5B) to 24 (2B).

Key takeaway

For AI Engineers optimizing LLM performance on NVIDIA GPUs, adopting Sakana AI and NVIDIA's TwELL kernels can significantly boost inference and training throughput while reducing energy consumption. Your teams should investigate integrating these open-source kernels, especially for larger models where efficiency gains are more pronounced. Consider the minor accuracy trade-off against the substantial speed and energy benefits.

Key insights

TwELL and new CUDA kernels exploit LLM sparsity for significant speedups and energy savings on NVIDIA GPUs.

Principles

Method

TwELL uses a tile-wise ELLPACK format within the matmul kernel epilogue, a fused inference kernel, and a hybrid sparse format for training, alongside a modified training recipe (ReLU, L1 regularization).

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.