Understanding post-training of LLMs: DPO

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, short

Summary

Direct Preference Optimization (DPO) is a modern post-training technique for Large Language Models (LLMs) that addresses the limitations of Supervised Fine-Tuning (SFT). While SFT teaches format and compliance by averaging responses, it cannot discern preference between multiple acceptable answers. DPO, in contrast, teaches "taste" and subtle preference shifts by training on pairs of chosen and rejected answers for a given prompt, aiming to increase the probability of the chosen response and decrease that of the rejected one. Unlike Reinforcement Learning from Human Feedback (RLHF), DPO does not require a separate reward model or a PPO sampling loop, simplifying the optimization process. The dataset for DPO is constructed by having human annotators rank multiple responses generated by a base model for various prompts, creating preference pairs like (chosen, rejected). A critical aspect of DPO's mathematical formulation is the use of a reference model to anchor probabilities, preventing the trainable model from drifting wildly, collapsing diversity, or destroying previous good behavior by ensuring improvements are relative to the original model's scores.

Key takeaway

For AI Engineers focused on refining LLM output quality beyond basic compliance, you should consider implementing Direct Preference Optimization (DPO). This method allows your models to learn subtle human preferences and "taste" more effectively than SFT, without the overhead of training a separate reward model or managing complex RLHF pipelines. Your models will exhibit more nuanced and preferred responses, anchored to a reference model to maintain stability and prevent undesirable behavioral drift.

Key insights

DPO directly optimizes LLMs for human preferences without complex reward models or reinforcement learning loops.

Principles

Method

Generate multiple responses for prompts, have humans rank them to create (chosen, rejected) pairs, then train the model to increase chosen probability and decrease rejected probability, anchored by a reference model.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.