Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Dynamic Thinking-Token Selection (DynTS) is a novel KV cache compression method designed to enhance the efficiency of Large Reasoning Models (LRMs) like DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B. It addresses the substantial memory and computational overhead incurred by LRMs' extended reasoning traces. DynTS identifies and retains only "decision-critical tokens" and their associated Key-Value (KV) cache states, evicting redundant entries. Across six mathematical and scientific reasoning benchmarks (AIME24, AIME25, AMC23, GK23EN, MATH500, GPQA-D), DynTS improves Pass@1 by 2.6% compared to state-of-the-art methods. It also reduces inference latency by 1.84–2.62x and peak KV-cache memory footprint by 3.32–5.73x without compromising LRM reasoning performance.

Key takeaway

For MLOps Engineers and Machine Learning Engineers deploying Large Reasoning Models, DynTS offers a compelling solution to reduce inference costs and memory footprint. You should consider integrating this dynamic KV cache compression method, especially for long-decoding LRM tasks. Fine-tuning the budget and token retention ratios for your specific model architecture and task difficulty will be key to balancing efficiency and reasoning accuracy.

Key insights

DynTS efficiently prunes redundant reasoning tokens in LRMs by identifying decision-critical KV cache states.

Principles

Only 20-30% of LRM reasoning tokens are critical for the final answer.
Retaining redundant tokens significantly degrades LRM reasoning performance.
Local linguistic coherence is crucial for stable LRM performance.

Method

DynTS employs a learnable Importance Predictor to score each token's contribution to the final answer, periodically retaining high-scoring KV cache entries along with a local window, while evicting others.

In practice

Train an Importance Predictor on correct reasoning traces to identify critical tokens.
Implement periodic KV cache selection based on predicted token importance scores.
Optimize local window size and retention ratio for specific LRM architectures.

Topics

KV Cache Compression
Large Reasoning Models
Inference Efficiency
Token Pruning
DeepSeek-R1
Attention Mechanisms

Code references

Robin930/DynTS

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.