Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
Summary
A novel framework named Ranking-induced VERifiable (RiVER) has been introduced to train Large Language Models (LLMs) on score-based optimization tasks without requiring ground-truth solutions. RiVER utilizes deterministic execution feedback as continuous-valued supervision and addresses challenges like "scale dominance" and "frequency dominance" through calibrated reward shaping, which involves instance-wise comparisons and emphasizes top-ranked solvers. When applied to Qwen3-8B and GLM-Z1-9B-0414, RiVER improved their Algorithm Engineering Benchmark (ALE) rating rank by 8.9% and 9.4% respectively. Crucially, despite training exclusively on score-based tasks, RiVER also enhanced performance on exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average of 2.4% and 3.5%, a transferability not observed with baselines using raw execution scores.
Key takeaway
For Machine Learning Engineers training LLMs for coding tasks where ground-truth solutions are scarce, RiVER offers a significant advancement. You can improve your models' general coding ability, even on exact-solution benchmarks, by implementing RiVER's calibrated reward shaping with score-based optimization tasks. This approach avoids the limitations of ground-truth dependency and provides a robust method for enhancing LLM performance across diverse coding challenges.
Key insights
RiVER trains LLMs on score-based tasks without ground-truth, using calibrated reward shaping for general coding ability.
Principles
- Calibrated reward shaping is crucial for continuous rewards.
- Instance-wise comparisons mitigate score magnitude issues.
- Emphasize top-ranked solutions in reward feedback.
Method
RiVER applies group-relative reinforcement learning with calibrated reward shaping, addressing "scale dominance" and "frequency dominance" via instance-wise comparisons and emphasizing top-ranked solvers.
In practice
- Apply RiVER to LLM training for coding tasks.
- Use deterministic execution feedback for supervision.
- Consider score-based optimization for general coding.
Topics
- Reinforcement Learning
- Large Language Models
- Reward Shaping
- Code Generation
- ALE-Bench
- Qwen3-8B
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.