JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Summary
JURY-RL is a novel, label-free reinforcement learning framework designed to enhance large language model (LLM) reasoning, particularly in machine-checkable domains like mathematics. It addresses the limitations of existing label-free methods, such as majority voting or LLM-as-a-judge, which are prone to false positives and training instability. JURY-RL decouples answer proposal from reward disposal: a plurality vote from model rollouts proposes a candidate answer, which is then formally verified by a Lean theorem prover. If verified, supporting rollouts receive positive reward. When verification is inconclusive, JURY-RL employs ResZero (Residual-Zero), a fallback reward mechanism that discards the unverified proposal and redistributes a zero-mean, variance-preserving signal among residual answers. This approach maintains stable optimization gradients without reinforcing unverified consensus. Experiments show JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks, achieving pass@1 performance comparable to supervised ground-truth training and superior generalization in pass@k and response diversity.
Key takeaway
For research scientists developing advanced LLM reasoning capabilities, JURY-RL offers a robust framework to overcome the limitations of label-free training. You should consider integrating formal verification pipelines, like the Lean-based system, to ensure truth-aligned rewards and stable optimization. This approach mitigates false positives and training collapse common in heuristic reward systems, leading to models with superior generalization and solution diversity, especially in domains requiring verifiable correctness.
Key insights
JURY-RL uses formal verification to ensure truth-aligned, label-free reinforcement learning for LLMs, enhancing reasoning stability.
Principles
- Decouple answer proposal from reward disposal.
- Ensure reward scalability, truth-alignment, and optimization stability.
- Maintain zero-mean, variance-preserving gradients for inconclusive verification.
Method
JURY-RL proposes answers via majority vote, then uses a Lean verifier for reward disposal. Inconclusive verification triggers ResZero, which penalizes the unverified majority and redistributes reward among residual answers.
In practice
- Integrate formal verifiers (e.g., Lean) for high-fidelity reward signals.
- Implement caching for verification results to amortize computational cost.
- Tune the "c" hyperparameter to balance exploration and task performance.
Topics
- Reinforcement Learning with Verifiable Rewards
- Label-Free Reinforcement Learning
- Formal Verification
- Lean Theorem Prover
- ResZero Reward
Code references
- huggingface/lighteval
- ruixin31/Spurious_Rewards
- LiveCodeBench/LiveCodeBench
- WildEval/ZeroEval
- EleutherAI/lm-evaluation-harness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.