JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

2025-09-10 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

JURY-RL is a novel, label-free reinforcement learning framework designed to enhance large language model (LLM) reasoning, particularly in machine-checkable domains like mathematics. It addresses the limitations of existing label-free methods, such as majority voting or LLM-as-a-judge, which are prone to false positives and training instability. JURY-RL decouples answer proposal from reward disposal: a plurality vote from model rollouts proposes a candidate answer, which is then formally verified by a Lean theorem prover. If verified, supporting rollouts receive positive reward. When verification is inconclusive, JURY-RL employs ResZero (Residual-Zero), a fallback reward mechanism that discards the unverified proposal and redistributes a zero-mean, variance-preserving signal among residual answers. This approach maintains stable optimization gradients without reinforcing unverified consensus. Experiments show JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks, achieving pass@1 performance comparable to supervised ground-truth training and superior generalization in pass@k and response diversity.

Key takeaway

For research scientists developing advanced LLM reasoning capabilities, JURY-RL offers a robust framework to overcome the limitations of label-free training. You should consider integrating formal verification pipelines, like the Lean-based system, to ensure truth-aligned rewards and stable optimization. This approach mitigates false positives and training collapse common in heuristic reward systems, leading to models with superior generalization and solution diversity, especially in domains requiring verifiable correctness.

Key insights

JURY-RL uses formal verification to ensure truth-aligned, label-free reinforcement learning for LLMs, enhancing reasoning stability.

Principles

Decouple answer proposal from reward disposal.
Ensure reward scalability, truth-alignment, and optimization stability.
Maintain zero-mean, variance-preserving gradients for inconclusive verification.

Method

JURY-RL proposes answers via majority vote, then uses a Lean verifier for reward disposal. Inconclusive verification triggers ResZero, which penalizes the unverified majority and redistributes reward among residual answers.

In practice

Integrate formal verifiers (e.g., Lean) for high-fidelity reward signals.
Implement caching for verification results to amortize computational cost.
Tune the "c" hyperparameter to balance exploration and task performance.

Topics

Reinforcement Learning with Verifiable Rewards
Label-Free Reinforcement Learning
Formal Verification
Lean Theorem Prover
ResZero Reward

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.