Where does output diversity collapse in post-training?
Summary
Post-trained large language models (LLMs) exhibit reduced output diversity compared to their base counterparts, a phenomenon termed "output diversity collapse." This study traces diversity collapse across three Olmo 3 post-training lineages: Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, evaluating 13 models on 15 tasks using four text diversity metrics. Researchers found that the timing and magnitude of diversity collapse are strongly linked to training data composition, not solely the post-training method. For instance, the Think lineage loses most semantic diversity during supervised fine-tuning (SFT), while the Instruct lineage experiences a larger drop during direct preference optimization (DPO). Suppressing chain-of-thought (CoT) reasoning at inference time in Think models did not recover diversity but significantly reduced accuracy on complex tasks, indicating the collapse is embedded in model weights. The study also decomposes diversity loss into quality control and genuine narrowing, revealing a task-dependent split.
Key takeaway
For AI Engineers and Research Scientists developing or fine-tuning LLMs, your choice of training data composition, particularly for Supervised Fine-Tuning, critically determines when and how sharply output diversity collapses. Avoid single-teacher or dual-teacher distillation if output diversity is a priority; instead, use multi-source data. Recognize that diversity loss is embedded in model weights during training and cannot be recovered by inference-time adjustments like suppressing Chain-of-Thought. Consider RL without KL penalties for modest diversity recovery, and always evaluate diversity impact relative to specific task requirements.
Key insights
Output diversity collapse in LLMs is primarily driven by training data composition, not just post-training methods or generation format.
Principles
- Data composition dictates diversity collapse trajectory.
- CoT format does not impose diversity constraints.
- Diversity loss is task-dependent.
Method
The study traces output diversity through Olmo 3's SFT, DPO, and RL stages across Think, Instruct, and RL-Zero lineages, using EAD, SBERT, NLI, and Vendi Score metrics on 15 tasks.
In practice
- Broaden SFT data sources to mitigate diversity collapse.
- RL without KL penalties can partially reverse DPO-induced narrowing.
- Assess diversity impact relative to task characteristics.
Topics
- Output Diversity Collapse
- LLM Post-training
- Supervised Fine-Tuning
- Direct Preference Optimization
- Chain-of-Thought Reasoning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.