Understanding post-training of LLMs: DPO
Summary
Direct Preference Optimization (DPO) is a modern post-training technique for Large Language Models (LLMs) that addresses the limitations of Supervised Fine-Tuning (SFT). While SFT teaches format and compliance by averaging responses, it cannot discern preference between multiple acceptable answers. DPO, in contrast, teaches "taste" and subtle preference shifts by training on pairs of chosen and rejected answers for a given prompt, aiming to increase the probability of the chosen response and decrease that of the rejected one. Unlike Reinforcement Learning from Human Feedback (RLHF), DPO does not require a separate reward model or a PPO sampling loop, simplifying the optimization process. The dataset for DPO is constructed by having human annotators rank multiple responses generated by a base model for various prompts, creating preference pairs like (chosen, rejected). A critical aspect of DPO's mathematical formulation is the use of a reference model to anchor probabilities, preventing the trainable model from drifting wildly, collapsing diversity, or destroying previous good behavior by ensuring improvements are relative to the original model's scores.
Key takeaway
For AI Engineers focused on refining LLM output quality beyond basic compliance, you should consider implementing Direct Preference Optimization (DPO). This method allows your models to learn subtle human preferences and "taste" more effectively than SFT, without the overhead of training a separate reward model or managing complex RLHF pipelines. Your models will exhibit more nuanced and preferred responses, anchored to a reference model to maintain stability and prevent undesirable behavioral drift.
Key insights
DPO directly optimizes LLMs for human preferences without complex reward models or reinforcement learning loops.
Principles
- SFT imitates, DPO teaches preference.
- Anchor probabilities to prevent model drift.
- Preference data drives taste and ranking.
Method
Generate multiple responses for prompts, have humans rank them to create (chosen, rejected) pairs, then train the model to increase chosen probability and decrease rejected probability, anchored by a reference model.
In practice
- Collect diverse preference datasets.
- Use DPO for nuanced response quality.
- Compare DPO to RLHF for efficiency.
Topics
- Direct Preference Optimization
- Large Language Models
- Supervised Fine-Tuning
- Reinforcement Learning from Human Feedback
- Preference Optimization
Best for: AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.