Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Summary
Huawei Technologies Co., Ltd. researchers evaluated HiFloat (HiF8 and HiF4), a family of low-bit floating-point formats optimized for Ascend NPUs, for Large Language Model (LLM) inference. The study rigorously compared HiFloat against industry standards like MXFP and NVFP4 across weight-activation and KV-cache quantization tasks using models like Qwen3-8B and openPangu-7B. Findings indicate that for 8-bit quantization, INT8 excels with narrow-range weight data, while floating-point formats like HiF8 and MXFP8 are superior for high-variance activations. In the 4-bit regime, HiF4's hierarchical scaling proved crucial, preventing the accuracy collapse observed in integer formats and outperforming other 4-bit floating-point formats in many scenarios, especially when combined with Post-Training Quantization (PTQ) frameworks like SmoothQuant and SVDQuant. HiF4 maintained over 96.5% of BF16 baseline accuracy on Qwen3-8B and 97.0% on openPangu-7B for W4A4 quantization.
Key takeaway
For NLP Engineers optimizing LLM inference on Ascend NPUs, consider adopting HiF4 for 4-bit quantization. Its hierarchical scaling effectively mitigates accuracy degradation, especially when combined with Post-Training Quantization frameworks like SmoothQuant or SVDQuant, offering significantly better performance than integer-based methods at ultra-low bit widths. This approach can yield substantial efficiency gains without catastrophic accuracy loss.
Key insights
HiFloat formats, particularly HiF4, offer robust low-bit quantization for LLM inference on Ascend NPUs, especially at 4-bit precision.
Principles
- INT8 suits narrow-range data; floating-point formats excel with high-variance data.
- Hierarchical scaling prevents accuracy collapse in 4-bit quantization.
- Outlier mitigation strategies complement HiFloat's representational capacity.
Method
The study involved evaluating HiFloat formats (HiF8, HiF4) for LLM inference on Ascend NPUs, comparing them against MXFP and NVFP4 across weight-activation and KV-cache tasks, and assessing synergy with PTQ frameworks like SmoothQuant and SVDQuant.
In practice
- Use HiF4 for 4-bit LLM inference on Ascend NPUs.
- Combine HiFloat with SmoothQuant or SVDQuant for improved accuracy.
- Prioritize INT8 for 8-bit weight quantization if data range is narrow.
Topics
- Low-Bit Quantization
- HiFloat Formats
- LLM Inference
- Ascend NPUs
- KV Cache Optimization
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.