Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
Summary
Dynamic Thinking-Token Selection (DynTS) is a novel KV cache compression method designed to enhance the efficiency of Large Reasoning Models (LRMs) like DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B. It addresses the substantial memory and computational overhead incurred by LRMs' extended reasoning traces. DynTS identifies and retains only "decision-critical tokens" and their associated Key-Value (KV) cache states, evicting redundant entries. Across six mathematical and scientific reasoning benchmarks (AIME24, AIME25, AMC23, GK23EN, MATH500, GPQA-D), DynTS improves Pass@1 by 2.6% compared to state-of-the-art methods. It also reduces inference latency by 1.84–2.62x and peak KV-cache memory footprint by 3.32–5.73x without compromising LRM reasoning performance.
Key takeaway
For MLOps Engineers and Machine Learning Engineers deploying Large Reasoning Models, DynTS offers a compelling solution to reduce inference costs and memory footprint. You should consider integrating this dynamic KV cache compression method, especially for long-decoding LRM tasks. Fine-tuning the budget and token retention ratios for your specific model architecture and task difficulty will be key to balancing efficiency and reasoning accuracy.
Key insights
DynTS efficiently prunes redundant reasoning tokens in LRMs by identifying decision-critical KV cache states.
Principles
- Only 20-30% of LRM reasoning tokens are critical for the final answer.
- Retaining redundant tokens significantly degrades LRM reasoning performance.
- Local linguistic coherence is crucial for stable LRM performance.
Method
DynTS employs a learnable Importance Predictor to score each token's contribution to the final answer, periodically retaining high-scoring KV cache entries along with a local window, while evicting others.
In practice
- Train an Importance Predictor on correct reasoning traces to identify critical tokens.
- Implement periodic KV cache selection based on predicted token importance scores.
- Optimize local window size and retention ratio for specific LRM architectures.
Topics
- KV Cache Compression
- Large Reasoning Models
- Inference Efficiency
- Token Pruning
- DeepSeek-R1
- Attention Mechanisms
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.