FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

FairyFuse is a novel inference system designed for large language models (LLMs) on CPU-only platforms, which eliminates floating-point multiplications by leveraging ternary weights (values in $\{-1,0,+1\}$). This system, a companion to the Fairy2i complex-valued quantization-aware training method, fuses eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop, utilizing masked additions and subtractions. FairyFuse achieves a $29.6\times$ kernel speedup over FP32 and sustains 32.4 tokens per second on a single Intel Xeon 8558P socket, outperforming llama.cpp Q4_K_M by $1.24\times$. It maintains near-lossless quality, with a WikiText-2 perplexity of 5.52 (vs. 5.47 for FP16) and an average downstream accuracy of 66.0%, while using a 3.3 GB model footprint.

Key takeaway

For NLP engineers and research scientists deploying LLMs on CPU-only platforms, FairyFuse demonstrates that extreme quantization with ternary weights can significantly improve inference throughput without sacrificing model quality. You should consider adopting multiplication-free inference systems like FairyFuse to overcome memory bandwidth bottlenecks and achieve competitive performance against 4-bit baselines, especially for privacy-sensitive edge deployments where GPUs are not available.

Key insights

Ternary weights enable multiplication-free LLM inference on CPUs, significantly boosting throughput while preserving quality.

Principles

Method

FairyFuse packs 16 ternary weights into a uint32, decodes positive/negative masks via BMI2 _pext_u32, and performs masked AVX-512 additions/subtractions, consolidating eight sub-GEMVs into one loop.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.