Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Summary
Correction-OrIented Policy Optimization (CIPO) is a novel extension to Reinforcement Learning with Verifiable Rewards (RLVR) designed to improve large language model (LLM) reasoning and error correction. Traditional RLVR methods, like Group Relative Policy Optimization (GRPO), often struggle with sparse binary rewards and weak credit assignment, leading to ambiguous optimization signals from failed trajectories. CIPO addresses this by converting on-policy failed trajectories into correction-oriented supervision without external signals. It jointly optimizes these correction samples with the standard RLVR objective, enhancing the model's ability to self-correct. Experiments across 11 benchmarks in mathematical reasoning and code generation show CIPO consistently and significantly outperforms strong baselines, achieving a 7.63% gain on DebugBench for correction with Seed-Coder-8B and a 17.56% average accuracy gain across six mathematical benchmarks with Qwen-3-4B, surpassing GRPO by 4.55%. CIPO also yields stronger pass@K gains, indicating an expansion of intrinsic reasoning capacity.
Key takeaway
For AI Engineers and Research Scientists developing or fine-tuning LLMs for complex reasoning tasks, CIPO offers a robust method to significantly improve both reasoning and error-correction capabilities. You should consider implementing CIPO's approach of transforming failed trajectories into explicit correction signals, as it demonstrably expands intrinsic reasoning capacity beyond simple probability redistribution. This can lead to more reliable and generalizable models for applications like mathematical problem-solving and code generation, reducing reliance on costly external annotations or auxiliary models.
Key insights
CIPO enhances LLM reasoning by converting failed trajectories into explicit, directional correction signals for policy optimization.
Principles
- Failed trajectories offer rich, exploitable learning signals.
- Directional guidance is superior to uniform penalty.
- Adaptive mechanisms prevent policy degradation during correction.
Method
CIPO constructs correction pairs from on-policy failed trajectories, conditioning the model on the original prompt and its erroneous output to sample refined solutions. This correction objective is jointly optimized with the standard GRPO objective, incorporating adaptive replay and risk-averse reward shaping.
In practice
- Integrate self-correction into RLVR training for LLMs.
- Prioritize medium-difficulty prompts for efficient learning.
- Use risk-averse reward shaping to prevent capability regressions.
Topics
- Reinforcement Learning with Verifiable Rewards
- Correction-Oriented Policy Optimization
- Large Language Models
- Mathematical Reasoning
- Code Generation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.