ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
Summary
ADaPT, or Adaptive Dual-Process Thinking, is a novel token-level dual-process framework designed to enhance the efficiency of large reasoning models without sacrificing performance. It addresses the high computational cost associated with long chain-of-thought reasoning, a common issue in existing methods that often degrade reasoning capability when attempting to shorten or mix strategies. ADaPT tackles this by explicitly decoupling efficiency and correctness signals during training, introducing a "mode-selection token" to control fast and slow reasoning. Efficiency-related rewards are applied solely to this token, preventing penalties for correct long reasoning while promoting efficiency when suitable. This framework also offers precise and continuous control over the efficiency-performance trade-off at inference time, allowing a single trained model to adjust along the Pareto frontier by modifying the mode-selection token's generation probability. Experiments confirm ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.
Key takeaway
For Machine Learning Engineers optimizing large reasoning models, ADaPT offers a critical solution to the efficiency-performance dilemma. If you are struggling with high computational costs from long chain-of-thought reasoning, you should explore implementing token-level decoupling with a mode-selection token. This approach allows you to precisely control inference-time trade-offs, significantly reducing cost without degrading reasoning capability, thereby improving model deployment viability.
Key insights
ADaPT decouples efficiency and correctness in large reasoning models via a token-level dual-process framework.
Principles
- Decouple efficiency and correctness signals.
- Apply efficiency rewards to a dedicated token.
- Enable continuous trade-off control at inference.
Method
ADaPT introduces a mode-selection token to control fast/slow reasoning, applying efficiency rewards exclusively to this token during training to avoid penalizing correct long reasoning.
In practice
- Implement a mode-selection token for efficiency control.
- Adjust mode-selection token probability for inference trade-offs.
Topics
- Large Reasoning Models
- Chain-of-Thought
- Inference Efficiency
- Token-Level Decoupling
- Dual-Process Thinking
- Computational Cost
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.