FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
Summary
FairyFuse is a novel inference system designed for large language models (LLMs) on CPU-only platforms, which eliminates floating-point multiplications by leveraging ternary weights (values in $\{-1,0,+1\}$). This system, a companion to the Fairy2i complex-valued quantization-aware training method, fuses eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop, utilizing masked additions and subtractions. FairyFuse achieves a $29.6\times$ kernel speedup over FP32 and sustains 32.4 tokens per second on a single Intel Xeon 8558P socket, outperforming llama.cpp Q4_K_M by $1.24\times$. It maintains near-lossless quality, with a WikiText-2 perplexity of 5.52 (vs. 5.47 for FP16) and an average downstream accuracy of 66.0%, while using a 3.3 GB model footprint.
Key takeaway
For NLP engineers and research scientists deploying LLMs on CPU-only platforms, FairyFuse demonstrates that extreme quantization with ternary weights can significantly improve inference throughput without sacrificing model quality. You should consider adopting multiplication-free inference systems like FairyFuse to overcome memory bandwidth bottlenecks and achieve competitive performance against 4-bit baselines, especially for privacy-sensitive edge deployments where GPUs are not available.
Key insights
Ternary weights enable multiplication-free LLM inference on CPUs, significantly boosting throughput while preserving quality.
Principles
- Memory bandwidth is the dominant bottleneck for CPU LLM inference.
- Ternary weights offer $16\times$ compression over FP32.
- Fusing sub-operations reduces overhead and improves efficiency.
Method
FairyFuse packs 16 ternary weights into a uint32, decodes positive/negative masks via BMI2 _pext_u32, and performs masked AVX-512 additions/subtractions, consolidating eight sub-GEMVs into one loop.
In practice
- Deploy ternary LLMs on CPUs for high throughput.
- Utilize AVX-512 and BMI2 for multiplication-free execution.
- Fuse complex-valued sub-GEMVs to optimize memory access.
Topics
- LLM Inference Optimization
- Ternary Quantization
- CPU Performance
- AVX-512 Kernels
- Memory Bandwidth
Code references
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.