Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

Researchers from the University of Science and Technology of China and Meituan have developed Length Controlled Preference Optimization (LCPO), a method to reduce the lengthy Chain-of-Thought (CoT) outputs of Large Reasoning Models (LRMs) like Deepseek-R1-Distill-Qwen-1.5B and Deepseek-R1-Distill-Qwen-7B. LRMs often generate verbose responses, increasing computational costs and potentially leading to "overthinking" on simpler tasks. LCPO addresses this by analyzing generation path distributions and filtering trajectories based on difficulty estimation. It then applies a small-scale preference optimization, balancing the implicit reward related to Negative Log-Likelihood (NLL) loss. Experiments across six math reasoning benchmarks, including MATH-500 and GSM8K, show LCPO reduces average output length by over 50% while preserving reasoning performance. The method requires only 0.8k training samples and 50 training steps, significantly lowering computational demands compared to prior approaches.

Key takeaway

For AI Engineers optimizing Large Reasoning Models for efficiency, LCPO offers a practical approach to significantly reduce output length without sacrificing performance. You should consider implementing LCPO to prune unnecessary CoT steps, especially when computational resources are constrained or when models exhibit overthinking. This method's low data and training step requirements (0.8k samples, 50 steps) make it highly adaptable for fine-tuning existing LRMs like Deepseek-R1-Distill-Qwen-7B on specific reasoning benchmarks.

Key insights

LCPO efficiently prunes lengthy LRM Chain-of-Thought outputs by balancing NLL loss with small-scale preference optimization.

Principles

Method

LCPO analyzes LRM generation paths, filters data by difficulty, and uses preference optimization to balance NLL-related implicit rewards, enabling efficient length control with minimal training data.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.