Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
Summary
This research investigates the impact of compressed reasoning data on large language model (LLM) post-training, addressing the trade-off between performance and token cost. It introduces a taxonomy of Chain-of-Thought (CoT) reasoning: Explicit CoT (all operations), Composed CoT (combined operations), and Implicit CoT (omitted intermediates). Using a synthetic compositional reasoning task, experiments across various models revealed that coarser CoT requires more supervised fine-tuning (SFT) data. Composed and Implicit CoT benefit more from data scaling than Explicit CoT, with Composed CoT also gaining from data repetition, though Implicit CoT risks memorization. Notably, subsequent reinforcement learning with verifiable rewards (RLVR) can decompose compressed steps learned during SFT. Furthermore, unidirectional CoT ordering enhances generalization on longer sequential tasks, offering insights for CoT design under data constraints.
Key takeaway
For machine learning engineers optimizing LLM post-training with chain-of-thought data, strategically select your CoT compression. If data resources are limited, be aware that coarser CoT requires more SFT data. Prioritize Composed CoT for better data scaling and repetition benefits, but use Implicit CoT cautiously due to memorization risks. Consider integrating RLVR after SFT to further refine and decompose learned compressed reasoning steps, especially for complex tasks.
Key insights
The study clarifies how different CoT compression types affect LLM post-training and data efficiency.
Principles
- Coarser CoT requires more SFT data.
- Composed CoT scales better with data repetition.
- RLVR can decompose SFT-learned compressed steps.
Method
Proposed a taxonomy of CoT: Explicit (all operations), Composed (combined), and Implicit (omitted). Used a synthetic task to vary difficulty and compression.
In practice
- Prioritize Composed CoT for data scaling benefits.
- Consider RLVR to refine compressed SFT steps.
- Employ unidirectional CoT for sequential tasks.
Topics
- Chain-of-Thought
- LLM Post-Training
- Supervised Fine-Tuning
- Reinforcement Learning
- Reasoning Data Compression
- Model Generalization
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.