Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. This research introduces a "sleep-like" consolidation mechanism where a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During this "sleep," the model performs N offline recurrent passes over accumulated context, updating fast weights in its state-space model (SSM) blocks via a learned local rule. This approach shifts extra computation to the sleep phase, preserving the latency of wake-time prediction. The method was validated on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, and a realistic math reasoning task (GSM-Infinite), where standard transformers and SSM-attention hybrid models often fail. Increasing the sleep duration (N) consistently improved performance, with the most significant gains observed on examples requiring deeper reasoning.

Key takeaway

For Machine Learning Engineers developing long-context LLMs, consider integrating a "sleep-like" offline recurrence mechanism. This approach allows your models to perform deep reasoning over evicted context by consolidating information into fast weights during a dedicated phase, preserving low prediction latency. Increasing the number of recurrent passes (N) significantly improves accuracy on complex tasks, offering a scalable solution for memory-intensive applications.

Key insights

Offline recurrent passes during a "sleep" phase enable LLMs to consolidate evicted context into fast weights, improving deep reasoning without increasing inference latency.

Principles

Deep reasoning requires scalable computation, not just memory capacity.
Offline recurrence can transform transient context into useful internal state.
Increasing "sleep duration" (N) improves reasoning performance.

Method

When the context window fills, the model enters "sleep," performing N recurrent passes over accumulated context to update fast weights in SSM blocks via a learned local rule. The KV cache is then cleared.

In practice

Implement N offline recurrent passes for memory consolidation.
Apply to hybrid SSM-attention LLMs for long-context tasks.
Fine-tune pre-trained models like Jet-Nemotron or Ouro with this mechanism.

Topics

Large Language Models
Transformer Architecture
State-Space Models
Memory Consolidation
Offline Recurrence
Long-Context Processing
Inference Optimization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.