Where does output diversity collapse in post-training?

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Post-trained large language models (LLMs) exhibit reduced output diversity compared to their base counterparts, a phenomenon termed "output diversity collapse." This study traces diversity collapse across three Olmo 3 post-training lineages: Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, evaluating 13 models on 15 tasks using four text diversity metrics. Researchers found that the timing and magnitude of diversity collapse are strongly linked to training data composition, not solely the post-training method. For instance, the Think lineage loses most semantic diversity during supervised fine-tuning (SFT), while the Instruct lineage experiences a larger drop during direct preference optimization (DPO). Suppressing chain-of-thought (CoT) reasoning at inference time in Think models did not recover diversity but significantly reduced accuracy on complex tasks, indicating the collapse is embedded in model weights. The study also decomposes diversity loss into quality control and genuine narrowing, revealing a task-dependent split.

Key takeaway

For AI Engineers and Research Scientists developing or fine-tuning LLMs, your choice of training data composition, particularly for Supervised Fine-Tuning, critically determines when and how sharply output diversity collapses. Avoid single-teacher or dual-teacher distillation if output diversity is a priority; instead, use multi-source data. Recognize that diversity loss is embedded in model weights during training and cannot be recovered by inference-time adjustments like suppressing Chain-of-Thought. Consider RL without KL penalties for modest diversity recovery, and always evaluate diversity impact relative to specific task requirements.

Key insights

Output diversity collapse in LLMs is primarily driven by training data composition, not just post-training methods or generation format.

Principles

Data composition dictates diversity collapse trajectory.
CoT format does not impose diversity constraints.
Diversity loss is task-dependent.

Method

The study traces output diversity through Olmo 3's SFT, DPO, and RL stages across Think, Instruct, and RL-Zero lineages, using EAD, SBERT, NLI, and Vendi Score metrics on 15 tasks.

In practice

Broaden SFT data sources to mitigate diversity collapse.
RL without KL penalties can partially reverse DPO-induced narrowing.
Assess diversity impact relative to task characteristics.

Topics

Output Diversity Collapse
LLM Post-training
Supervised Fine-Tuning
Direct Preference Optimization
Chain-of-Thought Reasoning

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.