ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Summary
ThoughtFold is a novel framework designed to mitigate "over-thinking" issues in Large Reasoning Models (LRMs) that use Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). Existing RLVR methods often reinforce redundant explorations within long, outcome-correct CoT trajectories, leading to inefficient reasoning. ThoughtFold addresses this by employing fine-grained introspective preference learning. It identifies redundancy within individual correct trajectories to generate a spectrum of candidate sub-trajectories. A masked preference optimization objective then explicitly penalizes these redundant explorations, encouraging the model to form more concise reasoning paths. Experiments demonstrate that ThoughtFold significantly enhances efficiency, reducing the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining accuracy.
Key takeaway
For Machine Learning Engineers optimizing Large Reasoning Models, ThoughtFold presents a critical advancement for improving inference efficiency. If your models exhibit "over-thinking" or high token usage with Chain-of-Thoughts, consider exploring introspective preference learning techniques. This approach can significantly reduce computational costs and latency, as demonstrated by a 56% token reduction on DeepSeek-R1-Distill-Qwen-7B, without sacrificing accuracy. Implementing similar fine-grained optimization could make your LRM deployments more practical and scalable.
Key insights
ThoughtFold uses introspective preference learning to prune redundant steps in reasoning chains, significantly boosting LRM efficiency.
Principles
- Redundant explorations in CoTs hinder LRM efficiency.
- Fine-grained preference learning can optimize reasoning paths.
- Introspection identifies sub-trajectories for conciseness.
Method
ThoughtFold applies introspective analysis to correct CoT trajectories, generating sub-trajectories. It then uses masked preference optimization to penalize redundant steps, folding reasoning chains into efficient paths.
In practice
- Reduce LRM inference costs by 56%.
- Optimize CoT generation for specific tasks.
- Improve reasoning efficiency on DeepSeek-R1-Distill-Qwen-7B.
Topics
- Large Reasoning Models
- Chain-of-Thought
- Preference Learning
- AI Efficiency
- Introspective Learning
- Reinforcement Learning
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.