Still: Amortized KV Cache Compaction in a Single Forward Pass
Summary
Still introduces an amortized KV cache compaction method for long-horizon language models, addressing the critical memory bottleneck. This lightweight, per-layer Perceiver, trained once against a frozen base model, generates compact keys and values in a single forward pass. On Qwen and Gemma models, Still achieves superior speed-quality performance across compression ratios from 8x to 200x and context lengths from 8k to 128k. It surpasses the strongest baseline on the RULER grid by 8-22 points and supports free-form summarization, preserving full-context gains on HELMET and outperforming KV-Distill on LongBench. Its forward-pass nature enables iterative application, making long-context cache compaction tractable and useful at extreme compression.
Key takeaway
For machine learning engineers optimizing long-horizon language models, you should evaluate Still for KV cache management. This method significantly reduces memory bottlenecks and extends effective context lengths from 8k to 128k while preserving model quality, even at 200x compression. Consider integrating Still to enable iterative compaction, unlocking previously intractable long-context applications and improving summarization performance.
Key insights
Still offers an efficient, high-quality KV cache compaction method for long-context language models via a single forward pass Perceiver.
Principles
- KV cache compaction is crucial for long-horizon LM deployment.
- Amortized synthesis outperforms selection and per-context methods.
- Iterative compaction extends effective context length significantly.
Method
Still employs a small per-layer Perceiver, trained once against a frozen base model, to produce compact keys and values in a single forward pass, enabling iterative application.
In practice
- Apply Still for 8x to 200x KV cache compression.
- Extend LM context lengths from 8k to 128k with Still.
- Integrate Still for improved free-form summarization.
Topics
- KV Cache Compaction
- Language Models
- Perceiver Models
- Memory Optimization
- Long-Context NLP
- Qwen
- Gemma
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.