Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

2026-03-29 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Gist Sparse Attention (GSA) is a novel framework designed to scale large language models (LLMs) to long contexts by addressing the quadratic computational cost of full attention. GSA introduces interleaved "gist tokens" that serve as learnable summaries of raw token chunks and act as routing signals for sparse attention. The method involves compressing context into these gist tokens, selecting the most relevant gists based on query affinity, and then selectively "unfolding" the corresponding raw chunks for detailed attention. This coarse-to-fine mechanism combines compact global representations with targeted access to fine-grained evidence, integrated end-to-end into training without requiring architectural modifications or external retrieval modules. GSA also extends hierarchically via recursive "gist-of-gist" construction, enabling multi-resolution context access with logarithmic per-step decoding complexity. Empirical results on LongBench and RAG benchmarks show GSA consistently outperforms other compression and inference-time sparse attention baselines across compression ratios from 8x to 32x.

Key takeaway

For NLP engineers and research scientists developing long-context LLMs, GSA offers a robust, end-to-end trainable solution to mitigate quadratic attention costs. By integrating gist-based compression with selective unfolding, you can achieve superior performance on benchmarks like LongBench and RAG compared to existing methods, especially in multi-document settings where GSA effectively filters distractors. Consider implementing GSA's hierarchical variant for log-linear complexity in extremely long sequences, ensuring efficient and accurate information retrieval without architectural changes.

Key insights

GSA uses learnable gist tokens to compress context and selectively unfold relevant raw chunks, achieving efficient long-context LLM processing.

Principles

Compression can serve as a routing signal for sparse attention.
Hierarchical summarization enables log-linear complexity.
Learned representations outperform fixed statistical summaries for selection.

Method

GSA compresses context into interleaved gist tokens, scores their relevance to a query, selects top-k gists, and unfolds their raw tokens for hybrid attention. This process is end-to-end trainable, optionally with hierarchical gist-of-gist construction.

In practice

Implement gist tokens as learned routing signals for sparse attention.
Use adaptive top-k selection for stable performance across tasks.
Consider hierarchical gist-of-gist for very high compression ratios.

Topics

Gist Sparse Attention
Selective Unfolding
Hierarchical Compression
Long-Context LLMs
Sparse Attention

Code references

yuzhenmao/gist-sparse-attention

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.