OjaKV: Context-Aware Online Low-Rank KV Cache Compression

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, extended

Summary

OjaKV is a novel framework designed to address the significant memory bottleneck of Key-Value (KV) caches in large language models (LLMs) during long-context autoregressive generation. For instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16 GB for its KV cache, rivaling the model's weights. Unlike existing low-rank compression methods that rely on static, offline-learned subspaces and perform poorly under data distribution shifts, OjaKV employs a hybrid storage policy. It preserves crucial first and most recent tokens in full-rank while applying low-rank compression to intermediate tokens. This compression uses Oja's algorithm for online principal component analysis, adapting the projection basis during prompt prefilling and decoding. OjaKV is compatible with modern attention modules like FlashAttention and demonstrates superior zero-shot accuracy at high compression ratios, especially on complex, long-context reasoning benchmarks. It also offers compounded memory savings when combined with token-selection methods.

Key takeaway

For NLP Engineers and Research Scientists developing long-context LLM applications, OjaKV offers a practical, plug-and-play solution to mitigate KV cache memory bottlenecks without fine-tuning. Implement OjaKV to achieve significant memory savings (e.g., 16 GB to 11.6 GB for Llama-3.1-8B at 32K tokens) while maintaining or improving accuracy, particularly in dynamic, long-context reasoning tasks. Consider integrating it with token-selection methods for even greater memory efficiency.

Key insights

OjaKV dynamically adapts KV cache compression using online PCA and a hybrid storage policy to maintain accuracy in long-context LLMs.

Principles

Method

OjaKV uses a hybrid KV cache policy, retaining critical tokens in full-rank. Intermediate tokens undergo low-rank compression via Oja's algorithm, with comprehensive updates during prefill and lightweight periodic updates during decoding.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.