The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning
Summary
Process Reward Models (PRMs), designed to improve credit assignment for reasoning with step-level feedback, exhibit a hidden bias. This bias stems from severe imbalance in step-level training data, which standard cross-entropy training amplifies, causing PRMs to overcredit plausible but incorrect steps and generate high false-positive rates. These false positives detrimentally steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning, unlike false negatives which primarily slow exploration. To address this, PRISM (Precision Ranking for Improved Step Modeling) is introduced as a policy-aware PRM training framework. PRISM learns from contrastive step-level comparisons and hard negatives generated via a temporal lookahead strategy, requiring no new human labels. It also employs a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives by 22% on PRMBench and improves macro F1. When applied to policy optimization and search tasks, it consistently improves accuracy by up to 22% for guided decoding and 33% for Best-of-N, enhancing robustness.
Key takeaway
For machine learning engineers developing process-supervised models, you should re-evaluate your PRM training strategies. Standard cross-entropy methods amplify biases, leading to high false positives that actively degrade reasoning quality in tasks like guided decoding and Best-of-N selection. Consider implementing PRISM's contrastive learning framework, which employs temporal lookahead and hard negatives to significantly reduce false positives and improve accuracy by up to 33% in selection tasks, ensuring your models reward genuinely correct reasoning.
Key insights
Process Reward Models (PRMs) suffer from hidden bias; PRISM offers a contrastive learning framework to reward correct reasoning and reduce false positives.
Principles
- PRM training should prioritize reliable relative comparisons over pointwise label fitting.
- False positives in PRMs actively steer reasoning toward flaws, unlike false negatives.
- Trustworthy process supervision requires rewarding the right reasoning for the right reasons.
Method
PRISM trains PRMs using contrastive step-level comparisons and hard negatives from a temporal lookahead strategy, optimizing the contrastive step margin with a difficulty-aware curriculum.
In practice
- Apply PRISM to reduce PRM false positives by 22% on PRMBench.
- Improve guided decoding accuracy by up to 22% using PRISM.
- Enhance Best-of-N selection accuracy by up to 33% with PRISM.
Topics
- Process Reward Models
- PRISM Framework
- Contrastive Learning
- Bias Mitigation
- Guided Decoding
- Policy Optimization
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.