Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

2026-06-25 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A novel framework named Ranking-induced VERifiable (RiVER) has been introduced to train Large Language Models (LLMs) on score-based optimization tasks without requiring ground-truth solutions. RiVER utilizes deterministic execution feedback as continuous-valued supervision and addresses challenges like "scale dominance" and "frequency dominance" through calibrated reward shaping, which involves instance-wise comparisons and emphasizes top-ranked solvers. When applied to Qwen3-8B and GLM-Z1-9B-0414, RiVER improved their Algorithm Engineering Benchmark (ALE) rating rank by 8.9% and 9.4% respectively. Crucially, despite training exclusively on score-based tasks, RiVER also enhanced performance on exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average of 2.4% and 3.5%, a transferability not observed with baselines using raw execution scores.

Key takeaway

For Machine Learning Engineers training LLMs for coding tasks where ground-truth solutions are scarce, RiVER offers a significant advancement. You can improve your models' general coding ability, even on exact-solution benchmarks, by implementing RiVER's calibrated reward shaping with score-based optimization tasks. This approach avoids the limitations of ground-truth dependency and provides a robust method for enhancing LLM performance across diverse coding challenges.

Key insights

RiVER trains LLMs on score-based tasks without ground-truth, using calibrated reward shaping for general coding ability.

Principles

Calibrated reward shaping is crucial for continuous rewards.
Instance-wise comparisons mitigate score magnitude issues.
Emphasize top-ranked solutions in reward feedback.

Method

RiVER applies group-relative reinforcement learning with calibrated reward shaping, addressing "scale dominance" and "frequency dominance" via instance-wise comparisons and emphasizing top-ranked solvers.

In practice

Apply RiVER to LLM training for coding tasks.
Use deterministic execution feedback for supervision.
Consider score-based optimization for general coding.

Topics

Reinforcement Learning
Large Language Models
Reward Shaping
Code Generation
ALE-Bench
Qwen3-8B

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.