XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

XFP is a novel dynamic weight quantizer designed for Large Language Model (LLM) inference that automates the quantization process by allowing operators to specify reconstruction quality floors based on per-channel cosine similarity. It eliminates the need for Hessian information, calibration data, or manual bit-width selection. XFP decomposes each weight matrix into a sparse fp16 outlier residual and a dense sub-byte index tensor, utilizing per-group learned codebooks. It features two storage modes, V2 and V2a, sharing an auto-select frontend and a fused decode kernel. On Qwen3.5-122B-A10B, XFP V2 achieves 138 tok/s single-stream decode on RTX PRO 6000 Blackwell hardware, outperforming Marlin INT4 by 49% at TP=1. For larger models, the H-Process iteratively adjusts cosine thresholds to fit models within memory constraints, demonstrated by fitting Qwen3.5-397B-A17B into 2x96 GB at ~3.4 effective bits, yielding 100.9 tok/s and 66.72% GSM8K strict-match.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference, XFP offers a significant advancement by automating quantization based on quality targets rather than manual bit-width selection. You should consider evaluating XFP for its potential to improve throughput and memory efficiency, especially for large models like Qwen3.5-397B-A17B, where it demonstrates superior performance over traditional INT4 methods with expert pruning. This approach could simplify deployment and reduce the expertise required for efficient LLM serving.

Key insights

XFP is a dynamic LLM quantizer that automates codebook size and outlier budget based on operator-defined quality floors.

Principles

Method

XFP quantizes by setting per-channel cosine similarity quality floors, then automatically determines codebook size, outlier budget, and packing per layer without manual tuning.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.