Critic-Free, Not Bias-Free: Correcting Advantage Bias in RL from Verifier Feedback
Summary
A new paper identifies a fundamental bias in group-based Reinforcement Learning from Human Feedback (RLHF) methods like GRPO, GSPO, and DAPO, which are commonly used for reasoning-oriented LLMs. When conditioning on "non-degenerate groups" (at least one success and one failure), the group-relative advantage estimator systematically underestimates advantages for prompts where the model is weak (success probability < 0.5) and overestimates them for easy prompts (success probability > 0.5). This bias, significant for group sizes up to 8, leads to over-exploitation of easy tasks and under-training on challenging ones. To mitigate this, the authors propose History-Aware Adaptive Difficulty Weighting (HA-DW), a plug-in reweighting scheme that adjusts advantages based on a prompt's empirical success rate relative to a running "difficulty anchor." HA-DW consistently improves accuracy by several points across benchmarks like MATH500 and Minerva for Qwen3-4B, Qwen3-8B, and Llama 3.2 3B Instruct, demonstrating sample efficiency gains.
Key takeaway
For AI Engineers developing reasoning-oriented LLMs using group-based RLHF, you should be aware of the inherent bias in advantage estimation that can impede learning on challenging prompts. Implementing History-Aware Adaptive Difficulty Weighting (HA-DW) can provably reduce this bias, leading to more effective training and improved performance on benchmarks like MATH500 and Minerva. Consider integrating HA-DW into your existing GRPO, GSPO, or DAPO pipelines to enhance model capabilities and sample efficiency.
Key insights
Group-based RLHF methods for LLMs exhibit a bias that hinders learning on difficult prompts.
Principles
- Group-relative advantage estimators are inherently biased.
- Bias leads to over-exploitation of easy prompts.
Method
History-Aware Adaptive Difficulty Weighting (HA-DW) reweights advantages based on prompt difficulty relative to a running anchor, amplifying hard prompts and damping easy ones.
In practice
- Apply HA-DW to existing GRPO/GSPO/DAPO losses.
- Improve sample efficiency in RLHF training.
Topics
- RLHF Bias Correction
- Rotary Positional Embeddings
- Machine Translation
- Attention Mechanisms
- LLM Fine-tuning
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.