Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Dynamic Thinking-Token Selection (DynTS) is a novel KV cache compression method designed to enhance the efficiency of Large Reasoning Models (LRMs) like DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B. It addresses the substantial memory and computational overhead incurred by LRMs' extended reasoning traces. DynTS identifies and retains only "decision-critical tokens" and their associated Key-Value (KV) cache states, evicting redundant entries. Across six mathematical and scientific reasoning benchmarks (AIME24, AIME25, AMC23, GK23EN, MATH500, GPQA-D), DynTS improves Pass@1 by 2.6% compared to state-of-the-art methods. It also reduces inference latency by 1.84–2.62x and peak KV-cache memory footprint by 3.32–5.73x without compromising LRM reasoning performance.

Key takeaway

For MLOps Engineers and Machine Learning Engineers deploying Large Reasoning Models, DynTS offers a compelling solution to reduce inference costs and memory footprint. You should consider integrating this dynamic KV cache compression method, especially for long-decoding LRM tasks. Fine-tuning the budget and token retention ratios for your specific model architecture and task difficulty will be key to balancing efficiency and reasoning accuracy.

Key insights

DynTS efficiently prunes redundant reasoning tokens in LRMs by identifying decision-critical KV cache states.

Principles

Method

DynTS employs a learnable Importance Predictor to score each token's contribution to the final answer, periodically retaining high-scoring KV cache entries along with a local window, while evicting others.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.