ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

2026-05-05 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

ReSET is a novel method addressing accuracy and latency limitations when deploying Large Reasoning Models (LRMs) with NVFP4 low-precision inference. NVFP4, supported by Blackwell Tensor-Cores, offers significant computational and memory cost reductions. However, direct application degrades reasoning accuracy due to increased incorrect sampling at low-entropy symbolic tokens and over-concentration in high-uncertainty steps. Furthermore, existing NVFP4 kernels fail to deliver latency benefits in small-batch autoregressive decoding (M ≤ 8), where Tensor-Core utilization drops below 1%. ReSET proposes a reasoning-step entropy-based temperature scaling method, adapting decoding temperature using both token-level and step-level entropy signals. It also introduces a CUDA-core small-M NVFP4 kernel. This approach improves NVFP4 reasoning accuracy by up to ∼2 points, achieving +2.6 on AIME-120, and delivers up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16 on models like Qwen3-32B.

Key takeaway

For MLOps engineers deploying latency-critical Large Reasoning Models with NVFP4, recognize that NVFP4's peak throughput advantages collapse at small batch sizes (M ≤ 8). You should implement ReSET's step-aware temperature scaling to mitigate quantization-induced accuracy degradation and integrate its custom CUDA-core small-M NVFP4 GEMV kernel to achieve significant latency reductions, delivering up to 1.97× end-to-end speedup over BF16.

Key insights

Step-level uncertainty, not just token-level, is critical for accurate NVFP4 reasoning in Large Reasoning Models.

Principles

NVFP4 quantization increases incorrect sampling at low-entropy symbolic tokens.
Token-level entropy is strongly influenced by the surrounding reasoning step's uncertainty.
Adapting decoding temperature based on step-level entropy improves quantized reasoning accuracy.

Method

ReSET employs a step-aware threshold for temperature scaling, which adapts to online step-entropy estimates, combined with a custom CUDA-core small-M NVFP4 GEMV kernel for latency-critical decoding.

In practice

Implement step-aware temperature scaling for NVFP4 LRM inference.
Utilize custom CUDA-core small-M NVFP4 GEMV kernels.
Monitor step-level entropy to guide decoding decisions.

Topics

NVFP4
Large Reasoning Models
Quantization
Temperature Scaling
CUDA Kernels
Low-Latency Inference
Autoregressive Decoding

Code references

aiha-lab/ReSET

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.