Where does output diversity collapse in post-training?
Summary
Post-trained language models, such as Olmo 3, Think, Instruct, and RL-Zero, exhibit reduced output diversity compared to their base models, which negatively impacts inference-time scaling and creative tasks. A study traced this diversity collapse across 15 tasks and four text diversity metrics, finding that the location of collapse correlates with training data composition. Specifically, the Think lineage, which uses chain-of-thought distillation, loses most semantic diversity during supervised fine-tuning, and the impact of DPO is more pronounced in Instruct models. The research indicates that this collapse is embedded in the model weights by training data, not merely a result of the generation format, as suppressing chain-of-thought reasoning in Think models at inference did not alter answer-level diversity. Decomposing diversity loss showed that Think models retain more correct-answer diversity than Instruct, despite greater aggregate collapse, suggesting diversity collapse is primarily a training-time issue.
Key takeaway
For AI Engineers and Research Scientists developing or fine-tuning large language models, understanding that output diversity collapse is largely determined by training data composition, rather than inference-time methods, is critical. You should prioritize careful curation and analysis of your supervised fine-tuning and DPO datasets to mitigate diversity loss, as post-hoc inference adjustments are unlikely to restore varied outputs. Focus on data-centric solutions during the training phase to preserve model creativity and utility.
Key insights
Output diversity collapse in post-trained LMs is primarily determined by training data composition, not inference-time factors.
Principles
- Data composition dictates diversity collapse location.
- DPO's effect on diversity is lineage-dependent.
Method
The study traced output diversity across Olmo 3 lineages (Think, Instruct, RL-Zero) using 15 tasks and four text diversity metrics, decomposing loss into quality-control and residual components.
In practice
- Analyze training data for diversity impact.
- Evaluate DPO's effect on specific model lineages.
Topics
- Output Diversity Collapse
- Post-training Language Models
- Data Composition
- Supervised Fine-tuning
- Chain-of-Thought Reasoning
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.