Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Huawei Technologies Co., Ltd. researchers evaluated HiFloat (HiF8 and HiF4), a family of low-bit floating-point formats optimized for Ascend NPUs, for Large Language Model (LLM) inference. The study rigorously compared HiFloat against industry standards like MXFP and NVFP4 across weight-activation and KV-cache quantization tasks using models like Qwen3-8B and openPangu-7B. Findings indicate that for 8-bit quantization, INT8 excels with narrow-range weight data, while floating-point formats like HiF8 and MXFP8 are superior for high-variance activations. In the 4-bit regime, HiF4's hierarchical scaling proved crucial, preventing the accuracy collapse observed in integer formats and outperforming other 4-bit floating-point formats in many scenarios, especially when combined with Post-Training Quantization (PTQ) frameworks like SmoothQuant and SVDQuant. HiF4 maintained over 96.5% of BF16 baseline accuracy on Qwen3-8B and 97.0% on openPangu-7B for W4A4 quantization.

Key takeaway

For NLP Engineers optimizing LLM inference on Ascend NPUs, consider adopting HiF4 for 4-bit quantization. Its hierarchical scaling effectively mitigates accuracy degradation, especially when combined with Post-Training Quantization frameworks like SmoothQuant or SVDQuant, offering significantly better performance than integer-based methods at ultra-low bit widths. This approach can yield substantial efficiency gains without catastrophic accuracy loss.

Key insights

HiFloat formats, particularly HiF4, offer robust low-bit quantization for LLM inference on Ascend NPUs, especially at 4-bit precision.

Principles

Method

The study involved evaluating HiFloat formats (HiF8, HiF4) for LLM inference on Ascend NPUs, comparing them against MXFP and NVFP4 across weight-activation and KV-cache tasks, and assessing synergy with PTQ frameworks like SmoothQuant and SVDQuant.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.