Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings
Summary
A study on sequential Direct Preference Optimization (DPO) using Llama-3.1-8B-Instruct with LoRA adapters investigates how later training stages affect preferences learned earlier. Researchers evaluated sequential DPO across four distinct preference settings: distributional conflict (HH-RLHF), multi-attribute interaction (HelpSteer2), strong safety signal (PKU-SafeRLHF), and compatible response-quality objectives (UltraFeedback). The findings reveal that sequential DPO does not cause uniform forgetting; instead, preference changes range from partial degradation to stability, pair-level redistribution, or positive transfer, depending on objective relationship, signal strength, and training order. Mechanistic diagnostics, including gradient cosine similarity and adapter movement, showed that Stage 2 gradients were near-orthogonal to previous objectives, suggesting that direct gradient opposition is not the primary driver of preference change. The study used 83,886,080 trainable LoRA parameters (1.03% of the full model) and a learning rate of 5e-5 with beta=0.3.
Key takeaway
For ML Engineers designing LLM alignment pipelines, you should account for objective compatibility and signal strength. Do not assume uniform forgetting when adding new DPO objectives. Instead, analyze pair-level preference changes and consider training order to optimize for stability or positive transfer. This approach helps avoid unintended degradation of previously learned behaviors.
Key insights
Sequential DPO effects on LLM preferences vary by objective relationship, not uniform forgetting.
Principles
- Preference changes are heterogeneous, not uniform.
- Objective compatibility dictates transfer or degradation.
- Gradient opposition is not the primary forgetting driver.
Method
Sequential DPO trains Llama-3.1-8B-Instruct with LoRA adapters on two objectives. Evaluate all objectives after each stage using a fixed base-model reference and pair-level margin analysis.
In practice
- Analyze preference changes at pair-level via quartile decomposition.
- Measure gradient cosine similarity for mechanistic insights.
- Consider objective signal strength in alignment pipelines.
Topics
- Direct Preference Optimization
- Sequential Alignment
- Large Language Models
- LoRA Adapters
- Preference Forgetting
- Gradient Conflict
- Human Feedback
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.