SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

2026-06-25 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

SharQ is a training-free inference method designed to bridge activation sparsity and FP4 quantization for Large Language Model (LLM) inference, addressing challenges posed by input-dependent outliers and coupled sparsification/quantization errors. It employs an online sparse-dense decomposition, generating an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, which is then quantized to FP4. A dense residual is defined relative to this quantized sparse backbone. SharQ processes the backbone via a sparse FP4 GEMM and compensates for mask-induced loss and sparse-path quantization error with a dense FP4 GEMM, sharing a single FP4 weight payload. It requires no calibration, retraining, or model-specific tuning. Evaluated on models like Llama-3.1-8B and Qwen3-30B-A3B, SharQ recovers 43--63% of the NVFP4-to-FP16 accuracy gap. On an RTX 5090, it achieves 2.2--2.4x latency reduction over FP16 and 1.2--1.4x throughput improvement over FP8.

Key takeaway

For ML engineers optimizing LLM inference on modern accelerators, SharQ offers a compelling training-free approach to significantly boost performance. You can achieve 2.2--2.4x latency reduction over FP16 and 1.2--1.4x throughput improvement over FP8, while recovering substantial accuracy. Consider integrating SharQ to leverage FP4 quantization and activation sparsity without complex retraining or calibration, especially for models like Llama-3.1-8B and Qwen series on RTX 5090 GPUs.

Key insights

SharQ combines activation sparsity and FP4 quantization for LLM inference via an online sparse-dense decomposition.

Principles

Input-dependent outliers dominate FP4 block scales.
Direct N:M sparsity couples loss with quantization error.
Online sparse-dense decomposition improves accuracy.

Method

SharQ generates an N:M mask, quantizes the sparse backbone to FP4, defines a dense residual, and processes both paths with shared FP4 weights and path-specific scales.

In practice

Recover 43-63% NVFP4-to-FP16 accuracy gap.
Achieve 2.2-2.4x latency reduction over FP16.
Improve throughput 1.2-1.4x over FP8.

Topics

LLM Inference
FP4 Quantization
Activation Sparsity
Model Compression
GPU Acceleration
N:M Sparsity

Code references

actypedef/SharQ

Best for: Research Scientist, NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.