MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Summary
MDP-GRPO is a novel reinforcement learning method designed to stabilize Group Relative Policy Optimization (GRPO) when handling multi-constraint instruction following tasks with discrete, low-dispersion rewards. It addresses three identified pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. The method achieves stabilization through four key components: multi-temperature sampling to increase reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and asymmetric KL regularization. Evaluated across FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO demonstrated superior performance over standard GRPO, boosting strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. It also facilitates stable convergence with smaller group sizes while maintaining general capabilities on MMLU and ARC benchmarks.
Key takeaway
For Machine Learning Engineers developing instruction-following models with strict multi-constraints, if you are encountering instability or poor satisfaction rates with standard GRPO, consider implementing MDP-GRPO's techniques. Your models, like Llama-3.2-3B, could achieve up to 5.0% better strict constraint satisfaction and stable convergence even with smaller group sizes, preserving general capabilities. This approach offers a robust solution for complex, reward-sparse environments.
Key insights
MDP-GRPO stabilizes GRPO for multi-constraint instruction following by addressing z-score normalization pathologies with specific algorithmic enhancements.
Principles
- Z-score normalization fails with low-dispersion rewards.
- Increase reward dispersion for stable RL optimization.
- Prospect theory can bound updates and penalize violations.
Method
MDP-GRPO stabilizes GRPO via multi-temperature sampling, dual-anchor advantages, prospect-theoretic shaping based on Kahneman and Tversky's theory, and asymmetric KL regularization to manage multi-constraint instruction following.
In practice
- Apply multi-temperature sampling for reward dispersion.
- Use dual-anchor advantages in homogeneous reward groups.
- Implement prospect-theoretic shaping for constraint penalties.
Topics
- Reinforcement Learning
- Instruction Following
- Group Relative Policy Optimization
- Multi-Constraint Optimization
- Llama-3.2-3B
- Policy Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.