MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

MDP-GRPO is a novel method designed to stabilize Group Relative Policy Optimization (GRPO) for large language models (LLMs) performing multi-constraint instruction following, particularly when dealing with discrete, low-dispersion rewards. It addresses three key pathologies of standard GRPO: low-variance amplification, mean-centering blindness, and zero-variance collapse. The approach integrates multi-temperature sampling to enhance reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, and prospect-theoretic shaping to bound updates and penalize constraint violations based on Kahneman & Tversky's theory. Additionally, it employs asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a custom multi-constraint dataset, MDP-GRPO improves strict constraint satisfaction by up to 5.0% on Llama-3.2-3B and enables stable convergence with small group sizes like G=4, all while maintaining general capabilities on MMLU and ARC.

Key takeaway

For ML engineers developing LLMs for multi-constraint instruction following, where strict compliance is critical, you should consider integrating MDP-GRPO's stabilization techniques. Standard GRPO often struggles with discrete, low-dispersion rewards, leading to unstable training. Adopting multi-temperature sampling, dual-anchor advantages, and prospect-theoretic shaping can significantly improve strict constraint satisfaction and training stability, even with reduced group sizes, without degrading general model capabilities.

Key insights

MDP-GRPO stabilizes multi-constraint instruction following in LLMs by mitigating GRPO's reward-related pathologies.

Principles

Method

MDP-GRPO uses multi-temperature sampling for diverse groups, dual-anchor advantages (group-relative + goal-aware) for signal restoration, and prospect-theoretic shaping (bounded, asymmetric tanh) for stable, loss-averse updates, combined with asymmetric KL regularization.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.