XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
Summary
XFP is a novel dynamic weight quantizer designed for Large Language Model (LLM) inference that automates the quantization process by allowing operators to specify reconstruction quality floors based on per-channel cosine similarity. It eliminates the need for Hessian information, calibration data, or manual bit-width selection. XFP decomposes each weight matrix into a sparse fp16 outlier residual and a dense sub-byte index tensor, utilizing per-group learned codebooks. It features two storage modes, V2 and V2a, sharing an auto-select frontend and a fused decode kernel. On Qwen3.5-122B-A10B, XFP V2 achieves 138 tok/s single-stream decode on RTX PRO 6000 Blackwell hardware, outperforming Marlin INT4 by 49% at TP=1. For larger models, the H-Process iteratively adjusts cosine thresholds to fit models within memory constraints, demonstrated by fitting Qwen3.5-397B-A17B into 2x96 GB at ~3.4 effective bits, yielding 100.9 tok/s and 66.72% GSM8K strict-match.
Key takeaway
For NLP engineers and research scientists optimizing LLM inference, XFP offers a significant advancement by automating quantization based on quality targets rather than manual bit-width selection. You should consider evaluating XFP for its potential to improve throughput and memory efficiency, especially for large models like Qwen3.5-397B-A17B, where it demonstrates superior performance over traditional INT4 methods with expert pruning. This approach could simplify deployment and reduce the expertise required for efficient LLM serving.
Key insights
XFP is a dynamic LLM quantizer that automates codebook size and outlier budget based on operator-defined quality floors.
Principles
- Quality floors drive automatic quantization parameters.
- Decompose weights into sparse fp16 outliers and dense sub-byte indices.
Method
XFP quantizes by setting per-channel cosine similarity quality floors, then automatically determines codebook size, outlier budget, and packing per layer without manual tuning.
In practice
- Achieve 138 tok/s on Qwen3.5-122B-A10B with XFP V2.
- Use H-Process to fit large MoE models into constrained memory.
- Improve throughput and accuracy over INT4 with expert pruning.
Topics
- XFP Quantization
- LLM Inference Optimization
- Adaptive Codebook Quantization
- Sparse Outlier Separation
- H-Process
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.