ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training
Summary
ReQAT is a novel 4-bit floating-point quantization-aware training framework designed to overcome severe reasoning degradation in Large Reasoning Models (LRMs) when fully quantizing weights, activations, and KV caches (W4A4KV4). Existing quantization methods fail to recover accuracy in these scenarios, particularly impacting low-entropy tokens like digits and operators, where quantization noise inflates sampling errors. ReQAT addresses this with three components: Trace-Aligned QAT (TAQ) focuses updates on critical low-entropy decisions by revisiting identical reasoning traces; Selective Entropy Minimization (SEM) reinforces confidence at these positions; and Q-FIT provides a quantization-friendly initialization, calibrating RoPE-consistent KV cache transformations. This framework not only recovers but surpasses BF16 fine-tuning accuracy under the same training budget, achieving up to 3.9x throughput speedup on NVIDIA DGX Spark and 3.1x on B200.
Key takeaway
For Machine Learning Engineers deploying Large Reasoning Models, ReQAT offers a path to significantly reduce inference costs and KV cache footprints without sacrificing accuracy. You should investigate integrating 4-bit floating-point quantization-aware training, specifically focusing on techniques that address low-entropy token precision. This approach allows you to achieve up to 3.9x throughput speedup while potentially surpassing BF16 fine-tuning accuracy, making efficient LRM deployment feasible for resource-constrained environments.
Key insights
ReQAT enables 4-bit quantization for LRMs, recovering full-precision reasoning accuracy by targeting low-entropy token sensitivity.
Principles
- Low-entropy tokens are critical for reasoning accuracy.
- Quantization noise amplifies errors in symbolic commitments.
- Targeted QAT can recover reasoning performance.
Method
ReQAT employs Trace-Aligned QAT (TAQ) for critical decision updates, Selective Entropy Minimization (SEM) for confidence reinforcement, and Q-FIT for quantization-friendly initialization and KV cache calibration.
In practice
- Deploy LRMs with W4A4KV4 for efficiency.
- Focus QAT on low-entropy token precision.
- Use RoPE-consistent KV cache transformations.
Topics
- Quantization-Aware Training
- Large Reasoning Models
- FP4 Quantization
- KV Cache Optimization
- Model Inference Speedup
- Low-Entropy Tokens
Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.