Still: Amortized KV Cache Compaction in a Single Forward Pass

2026-06-05 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Still introduces an amortized KV cache compaction method for long-horizon language models, addressing the critical memory bottleneck. This lightweight, per-layer Perceiver, trained once against a frozen base model, generates compact keys and values in a single forward pass. On Qwen and Gemma models, Still achieves superior speed-quality performance across compression ratios from 8x to 200x and context lengths from 8k to 128k. It surpasses the strongest baseline on the RULER grid by 8-22 points and supports free-form summarization, preserving full-context gains on HELMET and outperforming KV-Distill on LongBench. Its forward-pass nature enables iterative application, making long-context cache compaction tractable and useful at extreme compression.

Key takeaway

For machine learning engineers optimizing long-horizon language models, you should evaluate Still for KV cache management. This method significantly reduces memory bottlenecks and extends effective context lengths from 8k to 128k while preserving model quality, even at 200x compression. Consider integrating Still to enable iterative compaction, unlocking previously intractable long-context applications and improving summarization performance.

Key insights

Still offers an efficient, high-quality KV cache compaction method for long-context language models via a single forward pass Perceiver.

Principles

KV cache compaction is crucial for long-horizon LM deployment.
Amortized synthesis outperforms selection and per-context methods.
Iterative compaction extends effective context length significantly.

Method

Still employs a small per-layer Perceiver, trained once against a frozen base model, to produce compact keys and values in a single forward pass, enabling iterative application.

In practice

Apply Still for 8x to 200x KV cache compression.
Extend LM context lengths from 8k to 128k with Still.
Integrate Still for improved free-form summarization.

Topics

KV Cache Compaction
Language Models
Perceiver Models
Memory Optimization
Long-Context NLP
Qwen
Gemma

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.