REVES: REvision and VErification--Augmented Training for Test-Time Scaling

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

REVES, a two-stage iterative framework, enhances Large Language Model (LLM) reasoning through test-time scaling via sequential revision. It addresses the misalignment between standard single-shot training objectives and multi-step inference dynamics. REVES alternates between online data/prompt augmentation and policy optimization, converting intermediate "near-miss" answers from successful recovery trajectories into decoupled revision and verification prompts. This approach concentrates training on effective answer transformation and error identification, enabling efficient off-policy data generation and reducing computational overhead compared to standard multi-turn RL. Empirically, REVES achieved +6.5 points over the RL baseline and +4.0 points over standard multi-turn training on LiveCodeBench. It also matched previously reported best results on circle packing using a Qwen3-4B model with fewer rollouts than larger evolutionary search systems, and generalized to out-of-distribution puzzles like n_queens and mini_sudoku.

Key takeaway

For Machine Learning Engineers developing LLMs for complex reasoning tasks, you should adopt REVES to align training with test-time sequential revision. This framework, which leverages mistake-driven data augmentation and decoupled revision/verification prompts, significantly boosts performance on coding, math, and puzzle benchmarks. Implement continual data augmentation to ensure your model learns from its evolving failure modes, rather than relying on static, quickly outdated datasets.

Key insights

REVES improves LLM sequential revision by training on "near-miss" intermediate steps, aligning training with multi-step inference.

Principles

Standard single-shot RL objectives misalign with multi-step test-time inference.
Improving sequential revision capability transfers to other revision-using TTS algorithms.
Continual data augmentation is crucial for aligning supervision with current model failures.

Method

REVES is a two-stage iterative framework: Stage I augments data by converting intermediate "near-miss" answers from successful SR rollouts into decoupled revision and verification prompts. Stage II trains the policy with single-turn RL on this augmented data.

In practice

Use decoupled revision and verification prompts for error identification and transformation.
Filter for successful trajectories to generate high-quality "near-miss" training data.
Employ continual data augmentation to keep training signals relevant.

Topics

Large Language Models
Sequential Revision
Reinforcement Learning
Test-Time Scaling
Data Augmentation
Error Identification
Code Generation

Code references

yxliu02/REVES

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.