MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Summary
MDP-GRPO is a novel method designed to stabilize Group Relative Policy Optimization (GRPO) for large language models (LLMs) performing multi-constraint instruction following, particularly when dealing with discrete, low-dispersion rewards. It addresses three key pathologies of standard GRPO: low-variance amplification, mean-centering blindness, and zero-variance collapse. The approach integrates multi-temperature sampling to enhance reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, and prospect-theoretic shaping to bound updates and penalize constraint violations based on Kahneman & Tversky's theory. Additionally, it employs asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a custom multi-constraint dataset, MDP-GRPO improves strict constraint satisfaction by up to 5.0% on Llama-3.2-3B and enables stable convergence with small group sizes like G=4, all while maintaining general capabilities on MMLU and ARC.
Key takeaway
For ML engineers developing LLMs for multi-constraint instruction following, where strict compliance is critical, you should consider integrating MDP-GRPO's stabilization techniques. Standard GRPO often struggles with discrete, low-dispersion rewards, leading to unstable training. Adopting multi-temperature sampling, dual-anchor advantages, and prospect-theoretic shaping can significantly improve strict constraint satisfaction and training stability, even with reduced group sizes, without degrading general model capabilities.
Key insights
MDP-GRPO stabilizes multi-constraint instruction following in LLMs by mitigating GRPO's reward-related pathologies.
Principles
- Z-score normalization in GRPO fails with discrete, low-dispersion rewards due to specific pathologies.
- Loss aversion, inspired by Prospect Theory, can stabilize policy updates by penalizing negative outcomes more severely.
- Mixing exploratory and exploitative samples increases within-group reward dispersion, preventing homogeneous groups.
Method
MDP-GRPO uses multi-temperature sampling for diverse groups, dual-anchor advantages (group-relative + goal-aware) for signal restoration, and prospect-theoretic shaping (bounded, asymmetric tanh) for stable, loss-averse updates, combined with asymmetric KL regularization.
In practice
- Implement multi-temperature sampling (e.g., T=[0.1,0.4,0.7,1.0]) to increase reward diversity.
- Use dual-anchor advantages with a conservative goal-aware center (e.g., max(μ_group, 0.5)).
- Apply prospect-theoretic shaping with λ_ > λ_+ to penalize constraint violations more.
Topics
- Reinforcement Learning with Verifiable Rewards
- Group Relative Policy Optimization
- Multi-constraint Instruction Following
- Large Language Models
- Prospect Theory
- Policy Gradient Stabilization
- Reward Shaping
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.