A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
Summary
A new analysis of Reinforcement Learning from Verifiable Rewards (RLVR) reveals that the common "acc(TRUE) - acc(RANDOM)" estimand for reward-design effect is systematically biased. This bias arises from conflating self-consistency elicitation, which sharpens policy toward its modal answer, with the true reward-design signal. Researchers derived an exact telescoping decomposition, "total = null + elicit + rd", using a controlled tabular-GRPO simulator to measure each term across five prior-strength levels. Findings indicate the reward-design fraction of the naive estimator ranges from 0.139 at a weak prior (ps=0.20) to 0.05 at a strong prior (ps=0.80), with the elicitation term changing sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirmed non-additivity (interaction ratio 0.385; AxC effect -0.089). Re-audits of two published results demonstrated the partition's diagnostic value, yielding elicitation share 0.98 and rd share 1.18 respectively. A reusable one-command harness is released for auditing alignment papers.
Key takeaway
For AI Scientists and Machine Learning Engineers designing or evaluating Reinforcement Learning from Verifiable Rewards (RLVR) systems, you must recognize that the "acc(TRUE) - acc(RANDOM)" metric for reward-design effect is fundamentally flawed. This estimand conflates genuine reward signals with self-consistency elicitation, leading to misinterpretations of system performance. You should adopt the proposed "total = null + elicit + rd" causal partition to accurately diagnose the true impact of your reward design. Consider using the released one-command harness to audit existing or new alignment papers for these distinct effects.
Key insights
The common RLVR reward-design estimand is biased, conflating self-consistency elicitation with genuine reward signal.
Principles
- Naive reward-design estimands are systematically biased.
- Self-consistency elicitation can dominate reward-design effects.
Method
Derive an exact telescoping decomposition "total = null + elicit + rd" to causally partition effects in RLVR, then measure terms across prior-strength levels.
In practice
- Re-audit published RLVR results using the causal partition.
- Apply the one-command harness to audit alignment papers.
Topics
- Reinforcement Learning
- Reward Design
- Causal Inference
- Self-Consistency
- RLVR
- Alignment Audits
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.