Boosting Direct Preference Optimization with Penalization
Summary
Direct Preference Optimization with Penalization (DPOP) is a new extension of Direct Preference Optimization (DPO) designed to enhance offline preference optimization. Unlike traditional DPO and its variants, which solely rely on chosen and rejected responses, DPOP incorporates a previously unused signal: the response generated by the reference model itself for the same prompt. This method augments the base preference loss with a gated penalty applied to reference-greedy responses, activating only when the current policy assigns a lower likelihood to the preferred response than to the rejected one. Benchmarked on AlpacaEval 2.0, DPOP demonstrates improved length-controlled win rates, achieving relative gains of 5.3% on Llama-3-8b-it and 4.4% on Gemma-2-9b-it compared to DPO, SimPO, and AlphaDPO. Ablation studies further indicate that a SimNPO-style length-normalized penalty performs better than NPO and token-level unlikelihood in this context.
Key takeaway
For Machine Learning Engineers fine-tuning large language models using preference datasets, Direct Preference Optimization with Penalization (DPOP) presents a significant performance uplift. By incorporating a gated penalty on reference-greedy responses, DPOP achieves relative win rate gains of 5.3% on Llama-3-8b-it and 4.4% on Gemma-2-9b-it over existing DPO methods. You should evaluate DPOP as a superior alternative to standard DPO, SimPO, or AlphaDPO to enhance your model's alignment and response quality.
Key insights
DPOP enhances DPO by penalizing reference-greedy responses, improving preference optimization performance.
Principles
- Offline preference optimization can leverage reference model outputs.
- Gated penalties can selectively improve policy alignment.
- Length-normalized penalties outperform token-level unlikelihood.
Method
DPOP extends DPO by adding a gated penalty on reference-greedy responses, activating when the policy favors the rejected response over the preferred one.
In practice
- Consider DPOP for fine-tuning large language models.
- Implement gated penalties in preference optimization objectives.
- Evaluate SimNPO-style length normalization for penalties.
Topics
- Direct Preference Optimization
- Preference Learning
- Large Language Models
- Model Fine-tuning
- AlpacaEval 2.0
- Llama-3-8b-it
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.