Memory Sparse Attention to Get to 100 Million Tokens in Context

2024-03-06 · Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Two recent papers address critical aspects of large language model performance. The first, "Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?", investigates how self-distillation, particularly when a teacher model is guided by full solutions, can inadvertently suppress a student model's explicit uncertainty markers and reasoning trace length. This suppression, observed in models like DeepSeek-R1-Distill-Qwen-7B, can significantly degrade mathematical reasoning capabilities, with AIME24 scores dropping from 54.79 to 20.21. The second paper, "MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens", introduces Memory Sparse Attention (MSA), a latent-memory transformer designed for extremely long contexts up to 100M tokens. MSA replaces dense self-attention with document-level sparse attention and incorporates document-wise RoPE for efficient scaling. Built on a Qwen3-4B backbone, MSA maintains 94.84% accuracy at 1M tokens on RULER NIAH, significantly outperforming the Qwen3-4B baseline's 24.69%.

Key takeaway

For AI Engineers and Research Scientists developing or fine-tuning LLMs for complex reasoning tasks, recognize that explicit uncertainty markers are crucial for robust performance. When applying self-distillation, carefully evaluate the impact of teacher guidance on student reasoning style, especially for mathematical problems. Additionally, if your application requires processing contexts far beyond typical LLM limits, explore Memory Sparse Attention (MSA) as a viable architecture for scaling to 100M tokens while maintaining retrieval fidelity.

Key insights

Self-distillation can harm reasoning by suppressing uncertainty, while sparse attention enables extreme context scaling.

Principles

Uncertainty expression is a valuable capability in LLM reasoning.
Teacher guidance with full solutions can over-compress student reasoning.
Document-wise positional encoding aids generalization to large memory banks.

Method

MSA uses document-level sparse attention and document-wise RoPE within a transformer, routing to relevant documents and loading compressed KV state for end-to-end training.

In practice

Preserve uncertainty markers in LLM training data.
Consider MSA for 100M+ token context applications.
Avoid overly concise teacher guidance in self-distillation.

Topics

Self-Distillation
Large Language Models
Mathematical Reasoning
Memory Sparse Attention
Long-Context Transformers

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.