MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

MDP-GRPO is a novel reinforcement learning method designed to stabilize Group Relative Policy Optimization (GRPO) when handling multi-constraint instruction following tasks with discrete, low-dispersion rewards. It addresses three identified pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. The method achieves stabilization through four key components: multi-temperature sampling to increase reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and asymmetric KL regularization. Evaluated across FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO demonstrated superior performance over standard GRPO, boosting strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. It also facilitates stable convergence with smaller group sizes while maintaining general capabilities on MMLU and ARC benchmarks.

Key takeaway

For Machine Learning Engineers developing instruction-following models with strict multi-constraints, if you are encountering instability or poor satisfaction rates with standard GRPO, consider implementing MDP-GRPO's techniques. Your models, like Llama-3.2-3B, could achieve up to 5.0% better strict constraint satisfaction and stable convergence even with smaller group sizes, preserving general capabilities. This approach offers a robust solution for complex, reward-sparse environments.

Key insights

MDP-GRPO stabilizes GRPO for multi-constraint instruction following by addressing z-score normalization pathologies with specific algorithmic enhancements.

Principles

Z-score normalization fails with low-dispersion rewards.
Increase reward dispersion for stable RL optimization.
Prospect theory can bound updates and penalize violations.

Method

MDP-GRPO stabilizes GRPO via multi-temperature sampling, dual-anchor advantages, prospect-theoretic shaping based on Kahneman and Tversky's theory, and asymmetric KL regularization to manage multi-constraint instruction following.

In practice

Apply multi-temperature sampling for reward dispersion.
Use dual-anchor advantages in homogeneous reward groups.
Implement prospect-theoretic shaping for constraint penalties.

Topics

Reinforcement Learning
Instruction Following
Group Relative Policy Optimization
Multi-Constraint Optimization
Llama-3.2-3B
Policy Optimization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.