Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

This paper introduces the Hybrid Reward-Cyclic (HRC) model and Dynamic Self-Play Preference Optimization (DSPPO) to enhance Large Language Model (LLM) alignment with complex human preferences. Standard Reinforcement Learning from Human Feedback (RLHF) often uses transitive scalar rewards, which fail to capture the cyclic nature of human preferences. The HRC model addresses this by explicitly decomposing preferences into orthogonal transitive (scalar) and cyclic (vector) components, leveraging game-theoretic decomposition. Complementing HRC, DSPPO treats alignment as a time-varying game, guiding the policy toward a Nash equilibrium. Experiments on synthetic data and RewardBench 2 demonstrate HRC's structural superiority, achieving higher accuracy (e.g., +1.23% on Gemma-2B-it) and faster convergence than baselines like Bradley-Terry (BT) and General Preference Model (GPM). Downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the framework's efficacy, with HRC+DSPPO achieving a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1 using Gemma-2B-it.

Key takeaway

For research scientists developing advanced LLM alignment techniques, adopting the Hybrid Reward-Cyclic (HRC) model and Dynamic Self-Play Preference Optimization (DSPPO) can significantly improve model performance and robustness. Your current RLHF pipelines, which may rely on purely transitive reward models, could be missing crucial cyclic preference dynamics. Implement HRC to explicitly capture both aspects and use DSPPO's dynamic scheduling to achieve more stable and effective convergence to the Nash equilibrium, especially in complex, multi-turn reasoning tasks.

Key insights

Explicitly decomposing human preferences into transitive and cyclic components improves LLM alignment and robustness.

Principles

Human preferences are both transitive and cyclic.
Game theory can decompose complex preference dynamics.
Dynamic optimization schedules enhance alignment convergence.

Method

The HRC model explicitly separates preferences into scalar transitive and vector cyclic components. DSPPO then optimizes the policy through a time-varying game, dynamically adjusting the influence of these components to guide convergence.

In practice

Use HRC for robust preference modeling in LLM alignment.
Apply DSPPO for dynamic, curriculum-based policy optimization.
Consider $\lambda=1.0$ for balancing transitive and cyclic signals.

Topics

LLM Alignment
Human Preferences
Hybrid Reward-Cyclic Model
Dynamic Self-Play Preference Optimization
Game-Theoretic Decomposition

Code references

lab-klc/Hybrid-Reward-Cyclic

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.