Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference
Summary
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. This research introduces a "sleep-like" consolidation mechanism where a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During this "sleep," the model performs N offline recurrent passes over accumulated context, updating fast weights in its state-space model (SSM) blocks via a learned local rule. This approach shifts extra computation to the sleep phase, preserving the latency of wake-time prediction. The method was validated on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, and a realistic math reasoning task (GSM-Infinite), where standard transformers and SSM-attention hybrid models often fail. Increasing the sleep duration (N) consistently improved performance, with the most significant gains observed on examples requiring deeper reasoning.
Key takeaway
For Machine Learning Engineers developing long-context LLMs, consider integrating a "sleep-like" offline recurrence mechanism. This approach allows your models to perform deep reasoning over evicted context by consolidating information into fast weights during a dedicated phase, preserving low prediction latency. Increasing the number of recurrent passes (N) significantly improves accuracy on complex tasks, offering a scalable solution for memory-intensive applications.
Key insights
Offline recurrent passes during a "sleep" phase enable LLMs to consolidate evicted context into fast weights, improving deep reasoning without increasing inference latency.
Principles
- Deep reasoning requires scalable computation, not just memory capacity.
- Offline recurrence can transform transient context into useful internal state.
- Increasing "sleep duration" (N) improves reasoning performance.
Method
When the context window fills, the model enters "sleep," performing N recurrent passes over accumulated context to update fast weights in SSM blocks via a learned local rule. The KV cache is then cleared.
In practice
- Implement N offline recurrent passes for memory consolidation.
- Apply to hybrid SSM-attention LLMs for long-context tasks.
- Fine-tune pre-trained models like Jet-Nemotron or Ouro with this mechanism.
Topics
- Large Language Models
- Transformer Architecture
- State-Space Models
- Memory Consolidation
- Offline Recurrence
- Long-Context Processing
- Inference Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.