Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
Summary
LLM accuracy can drop by up to 65% when users reveal task-critical information across multiple conversation turns, even with full context available. This "Lost in Conversation" degradation is significantly mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To enable scalable training, a low-cost sharding pipeline converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating manual annotation. Training solely on sharded GSM8K, the memory-augmented policy substantially improves multi-turn accuracy and generalizes zero-shot to harder math and out-of-domain long-context QA. These memory-trained models even outperform full-history baselines when given the full history at test time, indicating that learning compression fosters more robust incremental reasoning.
Key takeaway
For Machine Learning Engineers developing conversational AI, if your LLMs struggle with multi-turn interactions where context arrives incrementally, consider implementing memory-augmented policies. Training models to maintain a compact rolling memory, potentially using a sharding pipeline for data generation, can substantially improve accuracy and robustness. This approach fosters more effective incremental reasoning, even outperforming full-history attention, and generalizes well to complex tasks like math and long-context QA.
Key insights
Training LLMs with compact rolling memory significantly improves multi-turn reasoning by mitigating "Lost in Conversation" degradation.
Principles
- Compact rolling memory mitigates "Lost in Conversation" degradation.
- Learning to compress induces more robust incremental reasoning.
- Memory-trained models can outperform full-history baselines.
Method
A low-cost sharding pipeline converts single-turn QA datasets into multi-turn fragmented-information episodes, enabling scalable training without manual annotation for memory-augmented policies.
In practice
- Train on sharded GSM8K for multi-turn accuracy improvements.
- Achieve zero-shot generalization to harder math problems.
- Improve performance in out-of-domain long-context QA.
Topics
- LLM Reasoning
- Multi-Turn Conversations
- Context Management
- Memory-Augmented RL
- Data Sharding
- GSM8K
- Zero-Shot Generalization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.