Still: Amortized KV Cache Compaction in a Single Forward Pass

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Still introduces an amortized KV cache compaction method for long-horizon language models, addressing the critical memory bottleneck. This lightweight, per-layer Perceiver, trained once against a frozen base model, generates compact keys and values in a single forward pass. On Qwen and Gemma models, Still achieves superior speed-quality performance across compression ratios from 8x to 200x and context lengths from 8k to 128k. It surpasses the strongest baseline on the RULER grid by 8-22 points and supports free-form summarization, preserving full-context gains on HELMET and outperforming KV-Distill on LongBench. Its forward-pass nature enables iterative application, making long-context cache compaction tractable and useful at extreme compression.

Key takeaway

For machine learning engineers optimizing long-horizon language models, you should evaluate Still for KV cache management. This method significantly reduces memory bottlenecks and extends effective context lengths from 8k to 128k while preserving model quality, even at 200x compression. Consider integrating Still to enable iterative compaction, unlocking previously intractable long-context applications and improving summarization performance.

Key insights

Still offers an efficient, high-quality KV cache compaction method for long-context language models via a single forward pass Perceiver.

Principles

Method

Still employs a small per-layer Perceiver, trained once against a frozen base model, to produce compact keys and values in a single forward pass, enabling iterative application.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.