Memory Sparse Attention to Get to 100 Million Tokens in Context
Summary
Two recent papers address critical aspects of large language model performance. The first, "Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?", investigates how self-distillation, particularly when a teacher model is guided by full solutions, can inadvertently suppress a student model's explicit uncertainty markers and reasoning trace length. This suppression, observed in models like DeepSeek-R1-Distill-Qwen-7B, can significantly degrade mathematical reasoning capabilities, with AIME24 scores dropping from 54.79 to 20.21. The second paper, "MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens", introduces Memory Sparse Attention (MSA), a latent-memory transformer designed for extremely long contexts up to 100M tokens. MSA replaces dense self-attention with document-level sparse attention and incorporates document-wise RoPE for efficient scaling. Built on a Qwen3-4B backbone, MSA maintains 94.84% accuracy at 1M tokens on RULER NIAH, significantly outperforming the Qwen3-4B baseline's 24.69%.
Key takeaway
For AI Engineers and Research Scientists developing or fine-tuning LLMs for complex reasoning tasks, recognize that explicit uncertainty markers are crucial for robust performance. When applying self-distillation, carefully evaluate the impact of teacher guidance on student reasoning style, especially for mathematical problems. Additionally, if your application requires processing contexts far beyond typical LLM limits, explore Memory Sparse Attention (MSA) as a viable architecture for scaling to 100M tokens while maintaining retrieval fidelity.
Key insights
Self-distillation can harm reasoning by suppressing uncertainty, while sparse attention enables extreme context scaling.
Principles
- Uncertainty expression is a valuable capability in LLM reasoning.
- Teacher guidance with full solutions can over-compress student reasoning.
- Document-wise positional encoding aids generalization to large memory banks.
Method
MSA uses document-level sparse attention and document-wise RoPE within a transformer, routing to relevant documents and loading compressed KV state for end-to-end training.
In practice
- Preserve uncertainty markers in LLM training data.
- Consider MSA for 100M+ token context applications.
- Avoid overly concise teacher guidance in self-distillation.
Topics
- Self-Distillation
- Large Language Models
- Mathematical Reasoning
- Memory Sparse Attention
- Long-Context Transformers
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.