A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
Summary
This research reveals that the common "reward-design" effect estimand, Δ_naive = acc(True) - acc(Random), in Reinforcement Learning from Verifiable Rewards (RLVR) is systematically biased. It conflates self-consistency elicitation with genuine reward-design signal. Using a controlled tabular-GRPO simulator, an exact telescoping decomposition Δ_total = Δ_null + Δ_elicit + Δ_rd is derived, allowing measurement of each term across five prior-strength levels (p_s ∈ {0.20, 0.35, 0.50, 0.65, 0.80}). The reward-design fraction of Δ_naive ranges from 139% at weak prior (p_s=0.20) to 5% at strong prior (p_s=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2 × 2 × 2 factorial confirmed non-additivity (interaction ratio 0.385). Re-audits of two published results yielded Elicitation_Dominated (0.98 elicitation share) and Reward_Design_Dominated (1.18 rd share) verdicts, demonstrating the partition's diagnostic value. A reusable one-command audit harness is released.
Key takeaway
For AI Scientists and ML Engineers optimizing reasoning language models with RLVR, you must partition your observed gains. If your base model has a strong prior (p_s ≥ 0.65), your Δ_naive is likely elicitation-dominated, suggesting minimal marginal value from further reward engineering. Conversely, for weak-prior models (p_s ≤ 0.35), spurious rewards can hurt, making genuine reward-design investment high-priority. Run the provided audit protocol to accurately attribute performance gains and guide your resource allocation.
Key insights
The Δ_naive estimand in RLVR conflates self-consistency elicitation with genuine reward design, leading to biased attribution.
Principles
- Δ_naive is non-transferable across model families.
- Self-consistency elicitation sign-flips based on prior strength (p_s).
- Reward-design effect is strongly prior-dependent.
Method
The method defines four reward conditions (Frozen, Random, Spurious, True) and uses an exact telescoping decomposition Δ_total = Δ_null + Δ_elicit + Δ_rd to causally partition RLVR gains.
In practice
- Estimate base model's prior strength (p_s) on your task.
- Use the diagnostic protocol to partition RLVR gain.
- Run the one-command audit harness for alignment papers.
Topics
- Reinforcement Learning from Verifiable Rewards
- Causal Decomposition
- Self-Consistency Elicitation
- Reward Design
- Language Model Evaluation
- Prior Strength
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.