SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference
Summary
SharQ is a training-free inference method designed to bridge activation sparsity and FP4 quantization for Large Language Model (LLM) inference, addressing challenges posed by input-dependent outliers and coupled sparsification/quantization errors. It employs an online sparse-dense decomposition, generating an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, which is then quantized to FP4. A dense residual is defined relative to this quantized sparse backbone. SharQ processes the backbone via a sparse FP4 GEMM and compensates for mask-induced loss and sparse-path quantization error with a dense FP4 GEMM, sharing a single FP4 weight payload. It requires no calibration, retraining, or model-specific tuning. Evaluated on models like Llama-3.1-8B and Qwen3-30B-A3B, SharQ recovers 43--63% of the NVFP4-to-FP16 accuracy gap. On an RTX 5090, it achieves 2.2--2.4x latency reduction over FP16 and 1.2--1.4x throughput improvement over FP8.
Key takeaway
For ML engineers optimizing LLM inference on modern accelerators, SharQ offers a compelling training-free approach to significantly boost performance. You can achieve 2.2--2.4x latency reduction over FP16 and 1.2--1.4x throughput improvement over FP8, while recovering substantial accuracy. Consider integrating SharQ to leverage FP4 quantization and activation sparsity without complex retraining or calibration, especially for models like Llama-3.1-8B and Qwen series on RTX 5090 GPUs.
Key insights
SharQ combines activation sparsity and FP4 quantization for LLM inference via an online sparse-dense decomposition.
Principles
- Input-dependent outliers dominate FP4 block scales.
- Direct N:M sparsity couples loss with quantization error.
- Online sparse-dense decomposition improves accuracy.
Method
SharQ generates an N:M mask, quantizes the sparse backbone to FP4, defines a dense residual, and processes both paths with shared FP4 weights and path-specific scales.
In practice
- Recover 43-63% NVFP4-to-FP16 accuracy gap.
- Achieve 2.2-2.4x latency reduction over FP16.
- Improve throughput 1.2-1.4x over FP8.
Topics
- LLM Inference
- FP4 Quantization
- Activation Sparsity
- Model Compression
- GPU Acceleration
- N:M Sparsity
Code references
Best for: Research Scientist, NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.