StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

StreamKL is a novel fused GPU primitive designed to address the prohibitive $O(N_QN_K)$ memory and IO costs associated with attention distillation, particularly at long context lengths. It introduces an online formulation for coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, eliminating the need for storing quadratic intermediates. This innovation reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, making long-context distillation feasible on a single GPU. StreamKL delivers significant performance improvements, achieving up to $43\times$ speedup in the forward pass and $14\times$ in the backward pass over baseline methods. Attention distillation is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training.

Key takeaway

For Machine Learning Engineers training large language models or performing knowledge distillation, StreamKL fundamentally changes memory constraints. If you are struggling with $O(N_QN_K)$ memory costs for long contexts, adopting StreamKL's fused GPU primitive allows you to perform attention distillation on a single GPU, significantly reducing HBM footprint and accelerating both forward and backward passes. Evaluate integrating StreamKL to enable previously infeasible long-context training.

Key insights

StreamKL enables memory-efficient, long-context attention distillation by eliminating quadratic materialization of attention distributions through an online KL reduction.

Principles

Method

StreamKL uses an online formulation for coupled two-distribution KL reduction, executing a single one-pass forward kernel that streams query-key tiles via on-chip SRAM. The backward pass recomputes attention probabilities tile-by-tile.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.