MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

MDP-GRPO is a novel method designed to stabilize Group Relative Policy Optimization (GRPO) for large language models (LLMs) performing multi-constraint instruction following, particularly when dealing with discrete, low-dispersion rewards. It addresses three key pathologies of standard GRPO: low-variance amplification, mean-centering blindness, and zero-variance collapse. The approach integrates multi-temperature sampling to enhance reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, and prospect-theoretic shaping to bound updates and penalize constraint violations based on Kahneman & Tversky's theory. Additionally, it employs asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a custom multi-constraint dataset, MDP-GRPO improves strict constraint satisfaction by up to 5.0% on Llama-3.2-3B and enables stable convergence with small group sizes like G=4, all while maintaining general capabilities on MMLU and ARC.

Key takeaway

For ML engineers developing LLMs for multi-constraint instruction following, where strict compliance is critical, you should consider integrating MDP-GRPO's stabilization techniques. Standard GRPO often struggles with discrete, low-dispersion rewards, leading to unstable training. Adopting multi-temperature sampling, dual-anchor advantages, and prospect-theoretic shaping can significantly improve strict constraint satisfaction and training stability, even with reduced group sizes, without degrading general model capabilities.

Key insights

MDP-GRPO stabilizes multi-constraint instruction following in LLMs by mitigating GRPO's reward-related pathologies.

Principles

Z-score normalization in GRPO fails with discrete, low-dispersion rewards due to specific pathologies.
Loss aversion, inspired by Prospect Theory, can stabilize policy updates by penalizing negative outcomes more severely.
Mixing exploratory and exploitative samples increases within-group reward dispersion, preventing homogeneous groups.

Method

MDP-GRPO uses multi-temperature sampling for diverse groups, dual-anchor advantages (group-relative + goal-aware) for signal restoration, and prospect-theoretic shaping (bounded, asymmetric tanh) for stable, loss-averse updates, combined with asymmetric KL regularization.

In practice

Implement multi-temperature sampling (e.g., T=[0.1,0.4,0.7,1.0]) to increase reward diversity.
Use dual-anchor advantages with a conservative goal-aware center (e.g., max(μ_group, 0.5)).
Apply prospect-theoretic shaping with λ_ > λ_+ to penalize constraint violations more.

Topics

Reinforcement Learning with Verifiable Rewards
Group Relative Policy Optimization
Multi-constraint Instruction Following
Large Language Models
Prospect Theory
Policy Gradient Stabilization
Reward Shaping

Code references

m-salmani78/MDP-GRPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.