Critic-Free, Not Bias-Free: Correcting Advantage Bias in RL from Verifier Feedback

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

A new paper identifies a fundamental bias in group-based Reinforcement Learning from Human Feedback (RLHF) methods like GRPO, GSPO, and DAPO, which are commonly used for reasoning-oriented LLMs. When conditioning on "non-degenerate groups" (at least one success and one failure), the group-relative advantage estimator systematically underestimates advantages for prompts where the model is weak (success probability < 0.5) and overestimates them for easy prompts (success probability > 0.5). This bias, significant for group sizes up to 8, leads to over-exploitation of easy tasks and under-training on challenging ones. To mitigate this, the authors propose History-Aware Adaptive Difficulty Weighting (HA-DW), a plug-in reweighting scheme that adjusts advantages based on a prompt's empirical success rate relative to a running "difficulty anchor." HA-DW consistently improves accuracy by several points across benchmarks like MATH500 and Minerva for Qwen3-4B, Qwen3-8B, and Llama 3.2 3B Instruct, demonstrating sample efficiency gains.

Key takeaway

For AI Engineers developing reasoning-oriented LLMs using group-based RLHF, you should be aware of the inherent bias in advantage estimation that can impede learning on challenging prompts. Implementing History-Aware Adaptive Difficulty Weighting (HA-DW) can provably reduce this bias, leading to more effective training and improved performance on benchmarks like MATH500 and Minerva. Consider integrating HA-DW into your existing GRPO, GSPO, or DAPO pipelines to enhance model capabilities and sample efficiency.

Key insights

Group-based RLHF methods for LLMs exhibit a bias that hinders learning on difficult prompts.

Principles

Method

History-Aware Adaptive Difficulty Weighting (HA-DW) reweights advantages based on prompt difficulty relative to a running anchor, amplifying hard prompts and damping easy ones.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.