Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

2026-05-27 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

This research investigates the impact of compressed reasoning data on large language model (LLM) post-training, addressing the trade-off between performance and token cost. It introduces a taxonomy of Chain-of-Thought (CoT) reasoning: Explicit CoT (all operations), Composed CoT (combined operations), and Implicit CoT (omitted intermediates). Using a synthetic compositional reasoning task, experiments across various models revealed that coarser CoT requires more supervised fine-tuning (SFT) data. Composed and Implicit CoT benefit more from data scaling than Explicit CoT, with Composed CoT also gaining from data repetition, though Implicit CoT risks memorization. Notably, subsequent reinforcement learning with verifiable rewards (RLVR) can decompose compressed steps learned during SFT. Furthermore, unidirectional CoT ordering enhances generalization on longer sequential tasks, offering insights for CoT design under data constraints.

Key takeaway

For machine learning engineers optimizing LLM post-training with chain-of-thought data, strategically select your CoT compression. If data resources are limited, be aware that coarser CoT requires more SFT data. Prioritize Composed CoT for better data scaling and repetition benefits, but use Implicit CoT cautiously due to memorization risks. Consider integrating RLVR after SFT to further refine and decompose learned compressed reasoning steps, especially for complex tasks.

Key insights

The study clarifies how different CoT compression types affect LLM post-training and data efficiency.

Principles

Coarser CoT requires more SFT data.
Composed CoT scales better with data repetition.
RLVR can decompose SFT-learned compressed steps.

Method

Proposed a taxonomy of CoT: Explicit (all operations), Composed (combined), and Implicit (omitted). Used a synthetic task to vary difficulty and compression.

In practice

Prioritize Composed CoT for data scaling benefits.
Consider RLVR to refine compressed SFT steps.
Employ unidirectional CoT for sequential tasks.

Topics

Chain-of-Thought
LLM Post-Training
Supervised Fine-Tuning
Reinforcement Learning
Reasoning Data Compression
Model Generalization

Code references

dc-ai-projects/DC-Gen

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.