Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

REDIPO is an offline DPO data-construction pipeline designed to recover diverse valid answers in post-trained Large Language Models (LLMs) while preserving their alignment benefits. The pipeline samples responses from both base and instruct models, rewrites base-model responses using the instruct model, filters candidates for safety and instruction-following quality, and then builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improved NoveltyBench distinct_k by 134%, 33%, and 44% respectively, relative to instruct checkpoints. These gains largely maintained MTBench, IFEval, and Arena-Hard performance, and reduced direct-category HarmBench attack success rates.

Key takeaway

For Machine Learning Engineers or AI Scientists aiming to enhance the output diversity of fine-tuned LLMs without compromising alignment, REDIPO offers a validated approach. This method demonstrates that reintroducing diverse valid answers from base-model generations is achievable through carefully constructed preference data. You should consider exploring the released code and data at https://github.com/vsamuel2003/RiDiPO to implement this DPO recipe in your post-training workflows, especially for open-ended instruction tasks.

Key insights

Post-training LLMs can regain output diversity without losing alignment by carefully constructing DPO preference data.

Principles

Post-training often narrows LLM output space.
Marginal diversity pairing drives diversity gains.
Filtering and quality-bounded pairing maintain alignment.

Method

REDIPO samples from base and instruct models, rewrites base responses with the instruct model, filters for safety and instruction-following, then builds preference pairs favoring marginally diverse responses among candidates with similar instruction-following reward.

In practice

Use REDIPO pipeline to reintroduce diverse valid answers.
Leverage base-model generations for diversity.
Filter candidates for safety and instruction-following quality.

Topics

DPO
LLM Fine-tuning
Output Diversity
Model Alignment
Preference Learning
Generative AI

Code references

vsamuel2003/RiDiPO

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.