Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs
Summary
REDIPO is an offline DPO data-construction pipeline designed to recover diverse valid answers in post-trained Large Language Models (LLMs) while preserving their alignment benefits. The pipeline samples responses from both base and instruct models, rewrites base-model responses using the instruct model, filters candidates for safety and instruction-following quality, and then builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improved NoveltyBench distinct_k by 134%, 33%, and 44% respectively, relative to instruct checkpoints. These gains largely maintained MTBench, IFEval, and Arena-Hard performance, and reduced direct-category HarmBench attack success rates.
Key takeaway
For Machine Learning Engineers or AI Scientists aiming to enhance the output diversity of fine-tuned LLMs without compromising alignment, REDIPO offers a validated approach. This method demonstrates that reintroducing diverse valid answers from base-model generations is achievable through carefully constructed preference data. You should consider exploring the released code and data at https://github.com/vsamuel2003/RiDiPO to implement this DPO recipe in your post-training workflows, especially for open-ended instruction tasks.
Key insights
Post-training LLMs can regain output diversity without losing alignment by carefully constructing DPO preference data.
Principles
- Post-training often narrows LLM output space.
- Marginal diversity pairing drives diversity gains.
- Filtering and quality-bounded pairing maintain alignment.
Method
REDIPO samples from base and instruct models, rewrites base responses with the instruct model, filters for safety and instruction-following, then builds preference pairs favoring marginally diverse responses among candidates with similar instruction-following reward.
In practice
- Use REDIPO pipeline to reintroduce diverse valid answers.
- Leverage base-model generations for diversity.
- Filter candidates for safety and instruction-following quality.
Topics
- DPO
- LLM Fine-tuning
- Output Diversity
- Model Alignment
- Preference Learning
- Generative AI
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.