Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study on sequential Direct Preference Optimization (DPO) using Llama-3.1-8B-Instruct with LoRA adapters investigates how later training stages affect preferences learned earlier. Researchers evaluated sequential DPO across four distinct preference settings: distributional conflict (HH-RLHF), multi-attribute interaction (HelpSteer2), strong safety signal (PKU-SafeRLHF), and compatible response-quality objectives (UltraFeedback). The findings reveal that sequential DPO does not cause uniform forgetting; instead, preference changes range from partial degradation to stability, pair-level redistribution, or positive transfer, depending on objective relationship, signal strength, and training order. Mechanistic diagnostics, including gradient cosine similarity and adapter movement, showed that Stage 2 gradients were near-orthogonal to previous objectives, suggesting that direct gradient opposition is not the primary driver of preference change. The study used 83,886,080 trainable LoRA parameters (1.03% of the full model) and a learning rate of 5e-5 with beta=0.3.

Key takeaway

For ML Engineers designing LLM alignment pipelines, you should account for objective compatibility and signal strength. Do not assume uniform forgetting when adding new DPO objectives. Instead, analyze pair-level preference changes and consider training order to optimize for stability or positive transfer. This approach helps avoid unintended degradation of previously learned behaviors.

Key insights

Sequential DPO effects on LLM preferences vary by objective relationship, not uniform forgetting.

Principles

Preference changes are heterogeneous, not uniform.
Objective compatibility dictates transfer or degradation.
Gradient opposition is not the primary forgetting driver.

Method

Sequential DPO trains Llama-3.1-8B-Instruct with LoRA adapters on two objectives. Evaluate all objectives after each stage using a fixed base-model reference and pair-level margin analysis.

In practice

Analyze preference changes at pair-level via quartile decomposition.
Measure gradient cosine similarity for mechanistic insights.
Consider objective signal strength in alignment pipelines.

Topics

Direct Preference Optimization
Sequential Alignment
Large Language Models
LoRA Adapters
Preference Forgetting
Gradient Conflict
Human Feedback

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.