ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

ReSET is a novel method addressing accuracy and latency limitations when deploying Large Reasoning Models (LRMs) with NVFP4 low-precision inference. NVFP4, supported by Blackwell Tensor-Cores, offers significant computational and memory cost reductions. However, direct application degrades reasoning accuracy due to increased incorrect sampling at low-entropy symbolic tokens and over-concentration in high-uncertainty steps. Furthermore, existing NVFP4 kernels fail to deliver latency benefits in small-batch autoregressive decoding (M ≤ 8), where Tensor-Core utilization drops below 1%. ReSET proposes a reasoning-step entropy-based temperature scaling method, adapting decoding temperature using both token-level and step-level entropy signals. It also introduces a CUDA-core small-M NVFP4 kernel. This approach improves NVFP4 reasoning accuracy by up to ∼2 points, achieving +2.6 on AIME-120, and delivers up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16 on models like Qwen3-32B.

Key takeaway

For MLOps engineers deploying latency-critical Large Reasoning Models with NVFP4, recognize that NVFP4's peak throughput advantages collapse at small batch sizes (M ≤ 8). You should implement ReSET's step-aware temperature scaling to mitigate quantization-induced accuracy degradation and integrate its custom CUDA-core small-M NVFP4 GEMV kernel to achieve significant latency reductions, delivering up to 1.97× end-to-end speedup over BF16.

Key insights

Step-level uncertainty, not just token-level, is critical for accurate NVFP4 reasoning in Large Reasoning Models.

Principles

Method

ReSET employs a step-aware threshold for temperature scaling, which adapts to online step-entropy estimates, combined with a custom CUDA-core small-M NVFP4 GEMV kernel for latency-critical decoding.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.