SuCo: Sufficiency-guided Continuous Adaptive Reasoning
Summary
Sufficiency-guided Continuous Adaptive Reasoning (SuCo) is a novel two-stage training framework designed to address the inefficiency of Large Reasoning Models (LRMs) that often produce overly long Chain-of-Thoughts (CoT), leading to inflated computational costs. SuCo introduces the concept of Minimal Sufficient CoT (MSC), defined as the shortest CoT prefix necessary for a correct answer, which empirically reduces reasoning tokens and enhances accuracy. The first stage, MSC-Aligned Fine-Tuning (MFT), generates MSC data using problem-adaptive sufficiency thresholds and fine-tunes the model to internalize concise reasoning patterns. The second stage, Sufficiency-Aware Policy Optimization (SAPO), employs reinforcement learning with dynamic complexity tracking and rewards that penalize both over- and under-thinking. Extensive experiments on mathematics, code, and science benchmarks demonstrate that SuCo consistently improves both accuracy and reasoning efficiency.
Key takeaway
For Machine Learning Engineers optimizing Large Reasoning Models, you should consider implementing sufficiency-guided training frameworks like SuCo. This approach, which defines and targets Minimal Sufficient CoT, offers a principled method to reduce computational costs associated with excessively long Chain-of-Thoughts while simultaneously enhancing model accuracy. By adopting adaptive reasoning control and sufficiency-aware rewards, your teams can achieve more efficient and precise LRM deployments across diverse tasks.
Key insights
Optimizing Chain-of-Thought length via sufficiency-guided adaptive reasoning improves LRM efficiency and accuracy.
Principles
- Minimal Sufficient CoT (MSC) reduces tokens and improves accuracy.
- Problem-adaptive sufficiency thresholds scale with question difficulty.
- Penalizing both over- and under-thinking optimizes reasoning.
Method
A two-stage framework: MSC-Aligned Fine-Tuning (MFT) constructs MSC data and fine-tunes for concise patterns, followed by Sufficiency-Aware Policy Optimization (SAPO) using RL with dynamic complexity tracking and sufficiency-aware rewards.
In practice
- Construct MSC data using adaptive sufficiency thresholds.
- Fine-tune models for concise reasoning patterns.
- Apply RL with dynamic complexity tracking and sufficiency-aware rewards.
Topics
- Large Reasoning Models
- Chain-of-Thought Optimization
- Sufficiency-guided Reasoning
- Reinforcement Learning
- Model Efficiency
- Adaptive Reasoning
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.