Boosting Direct Preference Optimization with Penalization
Summary
Direct Preference Optimization with Penalization (DPOP) is a novel extension to Direct Preference Optimization (DPO) that incorporates the reference model's greedy response as an additional training signal. DPOP augments the standard pairwise preference loss with a gated penalty applied to these reference-greedy responses. This penalty activates only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response, targeting specific instances where the policy struggles. Evaluated on AlpacaEval 2.0, DPOP significantly improves the length-controlled win rate (LC-WR) for both Llama-3-8b-it and Gemma-2-9b-it models. It achieved relative gains of 5.3% and 4.4% over strong baselines like SimPO and AlphaDPO, respectively. Ablation studies confirmed that a SimNPO-style length-normalized penalty is most effective, with linear weighting for the penalty function yielding the strongest results.
Key takeaway
For Machine Learning Engineers aligning large language models with human preferences, especially when using DPO or its variants, you should consider integrating Direct Preference Optimization with Penalization (DPOP). This method significantly improves length-controlled win rates by selectively penalizing the reference model's greedy responses, addressing instances where your policy misranks preferences. Implement DPOP with a SimNPO-style penalty and linear weighting for the strongest gains, but carefully tune hyperparameters like penalty weight and temperature for your specific model and dataset.
Key insights
DPOP improves preference optimization by selectively penalizing the reference model's greedy outputs when the policy misranks chosen over rejected responses.
Principles
- Reference-greedy responses provide useful signal.
- Penalties should be gated, not universal.
- Linear weighting of penalties is effective.
Method
DPOP adds a penalty on GreedyDecode(π_ref(·|x)) to the DPO base loss. This penalty is gated by 1[r<0] (policy ranks rejected > chosen) and weighted by f(r).
In practice
- Cache reference-greedy responses for training.
- Implement gated penalties for targeted correction.
- Use SimNPO-style length-normalized penalty.
Topics
- Direct Preference Optimization
- LLM Alignment
- Penalization Techniques
- SimNPO
- AlpacaEval 2.0
- Llama-3-8B-Instruct
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.