ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ReQAT is a novel 4-bit floating-point quantization-aware training framework designed to overcome severe reasoning degradation in Large Reasoning Models (LRMs) when fully quantizing weights, activations, and KV caches (W4A4KV4). Existing quantization methods fail to recover accuracy in these scenarios, particularly impacting low-entropy tokens like digits and operators, where quantization noise inflates sampling errors. ReQAT addresses this with three components: Trace-Aligned QAT (TAQ) focuses updates on critical low-entropy decisions by revisiting identical reasoning traces; Selective Entropy Minimization (SEM) reinforces confidence at these positions; and Q-FIT provides a quantization-friendly initialization, calibrating RoPE-consistent KV cache transformations. This framework not only recovers but surpasses BF16 fine-tuning accuracy under the same training budget, achieving up to 3.9x throughput speedup on NVIDIA DGX Spark and 3.1x on B200.

Key takeaway

For Machine Learning Engineers deploying Large Reasoning Models, ReQAT offers a path to significantly reduce inference costs and KV cache footprints without sacrificing accuracy. You should investigate integrating 4-bit floating-point quantization-aware training, specifically focusing on techniques that address low-entropy token precision. This approach allows you to achieve up to 3.9x throughput speedup while potentially surpassing BF16 fine-tuning accuracy, making efficient LRM deployment feasible for resource-constrained environments.

Key insights

ReQAT enables 4-bit quantization for LRMs, recovering full-precision reasoning accuracy by targeting low-entropy token sensitivity.

Principles

Method

ReQAT employs Trace-Aligned QAT (TAQ) for critical decision updates, Selective Entropy Minimization (SEM) for confidence reinforcement, and Q-FIT for quantization-friendly initialization and KV cache calibration.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.