Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
Summary
Researchers from the University of Science and Technology of China and Meituan have developed Length Controlled Preference Optimization (LCPO), a method to reduce the lengthy Chain-of-Thought (CoT) outputs of Large Reasoning Models (LRMs) like Deepseek-R1-Distill-Qwen-1.5B and Deepseek-R1-Distill-Qwen-7B. LRMs often generate verbose responses, increasing computational costs and potentially leading to "overthinking" on simpler tasks. LCPO addresses this by analyzing generation path distributions and filtering trajectories based on difficulty estimation. It then applies a small-scale preference optimization, balancing the implicit reward related to Negative Log-Likelihood (NLL) loss. Experiments across six math reasoning benchmarks, including MATH-500 and GSM8K, show LCPO reduces average output length by over 50% while preserving reasoning performance. The method requires only 0.8k training samples and 50 training steps, significantly lowering computational demands compared to prior approaches.
Key takeaway
For AI Engineers optimizing Large Reasoning Models for efficiency, LCPO offers a practical approach to significantly reduce output length without sacrificing performance. You should consider implementing LCPO to prune unnecessary CoT steps, especially when computational resources are constrained or when models exhibit overthinking. This method's low data and training step requirements (0.8k samples, 50 steps) make it highly adaptable for fine-tuning existing LRMs like Deepseek-R1-Distill-Qwen-7B on specific reasoning benchmarks.
Key insights
LCPO efficiently prunes lengthy LRM Chain-of-Thought outputs by balancing NLL loss with small-scale preference optimization.
Principles
- Shorter reasoning paths exist within LRM generation spaces.
- NLL loss can hinder length preference learning.
- Reward margins are crucial for length control.
Method
LCPO analyzes LRM generation paths, filters data by difficulty, and uses preference optimization to balance NLL-related implicit rewards, enabling efficient length control with minimal training data.
In practice
- Filter LRM outputs by question difficulty for concise training data.
- Use preference optimization to bias LRM trajectory distribution.
- Balance NLL loss with a counterpart term for length preference.
Topics
- Large Reasoning Models
- Chain-of-Thought Pruning
- Length Controlled Preference Optimization
- Preference Optimization
- Bradley-Terry Loss
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.