ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
Summary
ReSET is a novel method addressing accuracy and latency limitations when deploying Large Reasoning Models (LRMs) with NVFP4 low-precision inference. NVFP4, supported by Blackwell Tensor-Cores, offers significant computational and memory cost reductions. However, direct application degrades reasoning accuracy due to increased incorrect sampling at low-entropy symbolic tokens and over-concentration in high-uncertainty steps. Furthermore, existing NVFP4 kernels fail to deliver latency benefits in small-batch autoregressive decoding (M ≤ 8), where Tensor-Core utilization drops below 1%. ReSET proposes a reasoning-step entropy-based temperature scaling method, adapting decoding temperature using both token-level and step-level entropy signals. It also introduces a CUDA-core small-M NVFP4 kernel. This approach improves NVFP4 reasoning accuracy by up to ∼2 points, achieving +2.6 on AIME-120, and delivers up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16 on models like Qwen3-32B.
Key takeaway
For MLOps engineers deploying latency-critical Large Reasoning Models with NVFP4, recognize that NVFP4's peak throughput advantages collapse at small batch sizes (M ≤ 8). You should implement ReSET's step-aware temperature scaling to mitigate quantization-induced accuracy degradation and integrate its custom CUDA-core small-M NVFP4 GEMV kernel to achieve significant latency reductions, delivering up to 1.97× end-to-end speedup over BF16.
Key insights
Step-level uncertainty, not just token-level, is critical for accurate NVFP4 reasoning in Large Reasoning Models.
Principles
- NVFP4 quantization increases incorrect sampling at low-entropy symbolic tokens.
- Token-level entropy is strongly influenced by the surrounding reasoning step's uncertainty.
- Adapting decoding temperature based on step-level entropy improves quantized reasoning accuracy.
Method
ReSET employs a step-aware threshold for temperature scaling, which adapts to online step-entropy estimates, combined with a custom CUDA-core small-M NVFP4 GEMV kernel for latency-critical decoding.
In practice
- Implement step-aware temperature scaling for NVFP4 LRM inference.
- Utilize custom CUDA-core small-M NVFP4 GEMV kernels.
- Monitor step-level entropy to guide decoding decisions.
Topics
- NVFP4
- Large Reasoning Models
- Quantization
- Temperature Scaling
- CUDA Kernels
- Low-Latency Inference
- Autoregressive Decoding
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.