Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Summary
Gist Sparse Attention (GSA) is a novel framework designed to scale large language models (LLMs) to long contexts by addressing the quadratic computational cost of full attention. GSA introduces interleaved "gist tokens" that serve as learnable summaries of raw token chunks and act as routing signals for sparse attention. The method involves compressing context into these gist tokens, selecting the most relevant gists based on query affinity, and then selectively "unfolding" the corresponding raw chunks for detailed attention. This coarse-to-fine mechanism combines compact global representations with targeted access to fine-grained evidence, integrated end-to-end into training without requiring architectural modifications or external retrieval modules. GSA also extends hierarchically via recursive "gist-of-gist" construction, enabling multi-resolution context access with logarithmic per-step decoding complexity. Empirical results on LongBench and RAG benchmarks show GSA consistently outperforms other compression and inference-time sparse attention baselines across compression ratios from 8x to 32x.
Key takeaway
For NLP engineers and research scientists developing long-context LLMs, GSA offers a robust, end-to-end trainable solution to mitigate quadratic attention costs. By integrating gist-based compression with selective unfolding, you can achieve superior performance on benchmarks like LongBench and RAG compared to existing methods, especially in multi-document settings where GSA effectively filters distractors. Consider implementing GSA's hierarchical variant for log-linear complexity in extremely long sequences, ensuring efficient and accurate information retrieval without architectural changes.
Key insights
GSA uses learnable gist tokens to compress context and selectively unfold relevant raw chunks, achieving efficient long-context LLM processing.
Principles
- Compression can serve as a routing signal for sparse attention.
- Hierarchical summarization enables log-linear complexity.
- Learned representations outperform fixed statistical summaries for selection.
Method
GSA compresses context into interleaved gist tokens, scores their relevance to a query, selects top-k gists, and unfolds their raw tokens for hybrid attention. This process is end-to-end trainable, optionally with hierarchical gist-of-gist construction.
In practice
- Implement gist tokens as learned routing signals for sparse attention.
- Use adaptive top-k selection for stable performance across tasks.
- Consider hierarchical gist-of-gist for very high compression ratios.
Topics
- Gist Sparse Attention
- Selective Unfolding
- Hierarchical Compression
- Long-Context LLMs
- Sparse Attention
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.