Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Moonshot AI has open-sourced FlashKDA, a set of CUTLASS kernels designed to accelerate Kimi Delta Attention (KDA), the core mechanism in their Kimi Linear hybrid model. FlashKDA significantly improves prefill speed, achieving 1.72x to 2.22x faster performance than the flash-linear-attention baseline on NVIDIA H20 GPUs. KDA itself reduces KV cache usage by up to 75% and boosts decoding throughput by up to 6x at 1M context length by replacing traditional attention's expanding KV cache with a fixed-size recurrent state. FlashKDA supports variable-length batching via `cu_seqlens` and is auto-dispatched from `flash-linear-attention's chunk_kda`, requiring zero code changes. It is released under an MIT license, compatible with SM90+, CUDA 12.9+, and PyTorch 2.4+.

Key takeaway

For AI engineers optimizing large language model inference, FlashKDA offers a substantial performance boost for Kimi Delta Attention. If your projects utilize Kimi Linear or similar attention mechanisms, integrating FlashKDA can significantly reduce prefill times and improve overall throughput, especially on NVIDIA H20 GPUs. Consider adopting this MIT-licensed solution to enhance your model's efficiency without extensive code modifications.

Key insights

FlashKDA accelerates Kimi Delta Attention prefill, enhancing Moonshot AI's Kimi Linear model performance.

Principles

Method

FlashKDA utilizes CUTLASS kernels to accelerate Kimi Delta Attention, supporting variable-length batching and integrating seamlessly with existing `flash-linear-attention` implementations.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.