REVES: REvision and VErification--Augmented Training for Test-Time Scaling
Summary
REVES, a two-stage iterative framework, enhances Large Language Model (LLM) reasoning through test-time scaling via sequential revision. It addresses the misalignment between standard single-shot training objectives and multi-step inference dynamics. REVES alternates between online data/prompt augmentation and policy optimization, converting intermediate "near-miss" answers from successful recovery trajectories into decoupled revision and verification prompts. This approach concentrates training on effective answer transformation and error identification, enabling efficient off-policy data generation and reducing computational overhead compared to standard multi-turn RL. Empirically, REVES achieved +6.5 points over the RL baseline and +4.0 points over standard multi-turn training on LiveCodeBench. It also matched previously reported best results on circle packing using a Qwen3-4B model with fewer rollouts than larger evolutionary search systems, and generalized to out-of-distribution puzzles like n_queens and mini_sudoku.
Key takeaway
For Machine Learning Engineers developing LLMs for complex reasoning tasks, you should adopt REVES to align training with test-time sequential revision. This framework, which leverages mistake-driven data augmentation and decoupled revision/verification prompts, significantly boosts performance on coding, math, and puzzle benchmarks. Implement continual data augmentation to ensure your model learns from its evolving failure modes, rather than relying on static, quickly outdated datasets.
Key insights
REVES improves LLM sequential revision by training on "near-miss" intermediate steps, aligning training with multi-step inference.
Principles
- Standard single-shot RL objectives misalign with multi-step test-time inference.
- Improving sequential revision capability transfers to other revision-using TTS algorithms.
- Continual data augmentation is crucial for aligning supervision with current model failures.
Method
REVES is a two-stage iterative framework: Stage I augments data by converting intermediate "near-miss" answers from successful SR rollouts into decoupled revision and verification prompts. Stage II trains the policy with single-turn RL on this augmented data.
In practice
- Use decoupled revision and verification prompts for error identification and transformation.
- Filter for successful trajectories to generate high-quality "near-miss" training data.
- Employ continual data augmentation to keep training signals relevant.
Topics
- Large Language Models
- Sequential Revision
- Reinforcement Learning
- Test-Time Scaling
- Data Augmentation
- Error Identification
- Code Generation
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.