Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

Researchers from the University of Science and Technology of China and Meituan have developed Length Controlled Preference Optimization (LCPO), a method to reduce the lengthy Chain-of-Thought (CoT) outputs of Large Reasoning Models (LRMs) like Deepseek-R1-Distill-Qwen-1.5B and Deepseek-R1-Distill-Qwen-7B. LRMs often generate verbose responses, increasing computational costs and potentially leading to "overthinking" on simpler tasks. LCPO addresses this by analyzing generation path distributions and filtering trajectories based on difficulty estimation. It then applies a small-scale preference optimization, balancing the implicit reward related to Negative Log-Likelihood (NLL) loss. Experiments across six math reasoning benchmarks, including MATH-500 and GSM8K, show LCPO reduces average output length by over 50% while preserving reasoning performance. The method requires only 0.8k training samples and 50 training steps, significantly lowering computational demands compared to prior approaches.

Key takeaway

For AI Engineers optimizing Large Reasoning Models for efficiency, LCPO offers a practical approach to significantly reduce output length without sacrificing performance. You should consider implementing LCPO to prune unnecessary CoT steps, especially when computational resources are constrained or when models exhibit overthinking. This method's low data and training step requirements (0.8k samples, 50 steps) make it highly adaptable for fine-tuning existing LRMs like Deepseek-R1-Distill-Qwen-7B on specific reasoning benchmarks.

Key insights

LCPO efficiently prunes lengthy LRM Chain-of-Thought outputs by balancing NLL loss with small-scale preference optimization.

Principles

Shorter reasoning paths exist within LRM generation spaces.
NLL loss can hinder length preference learning.
Reward margins are crucial for length control.

Method

LCPO analyzes LRM generation paths, filters data by difficulty, and uses preference optimization to balance NLL-related implicit rewards, enabling efficient length control with minimal training data.

In practice

Filter LRM outputs by question difficulty for concise training data.
Use preference optimization to bias LRM trajectory distribution.
Balance NLL loss with a counterpart term for length preference.

Topics

Large Reasoning Models
Chain-of-Thought Pruning
Length Controlled Preference Optimization
Preference Optimization
Bradley-Terry Loss

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.