Qwen3.5 9B MoQ: Inside a Strong 3.6-bit GGUF
Summary
This intelligence brief details three advancements in AI model efficiency and performance. It first analyzes Qwen3.5 9B MoQ GGUFs, a 3.6-bit quantized model demonstrating superior benchmark performance and token efficiency compared to UD-Q2_K_XL. The MoQ model uses a more uniform 3-bit allocation across transformer tensors, embeddings, and output, achieving a 3.5877 bpw average. Second, LiquidAI's LFM2.5 8B A1B, an 8.3 billion parameter Mixture-of-Experts model with 1.5 billion active parameters, achieves high CPU decode speeds (253 tokens/sec on M5 Max) under 6 GB memory. It was trained on 38 trillion tokens, features a 128K context window, and a doubled 128,000-token vocabulary for multilingual efficiency. Finally, TokenSpeed is introduced as an inference system for long-context, short-query decode on MLA models, optimized for NVIDIA Blackwell GPUs. It improves attention kernel efficiency by reshaping computations and using split-KV parallelism, achieving up to 580 tps for Qwen3.5 397B A17B.
Key takeaway
For MLOps Engineers optimizing LLM deployment on edge devices or with long contexts, consider evaluating these new techniques. Qwen3.5 9B MoQ GGUFs offer strong accuracy at 3.6-bit average, potentially reducing memory footprint. LiquidAI's LFM2.5 8B A1B provides a performant, multilingual MoE solution for on-device inference. Furthermore, if you are running MLA models on NVIDIA Blackwell GPUs, integrate TokenSpeed to significantly boost long-context, short-query decode speeds.
Key insights
Advanced quantization and MoE architectures significantly boost LLM efficiency and performance on diverse hardware.
Principles
- Uniform low-bit quantization can outperform heterogeneous schemes.
- Small MoE models can achieve high performance with extensive training.
- Optimizing attention kernels improves long-context inference speed.
Method
MoQ involves a compact, smoother allocation of 3-bit quantization across most major transformer blocks, embeddings, and output layers, avoiding high-bit exceptions. TokenSpeed optimizes attention by folding query length into head dimension and using split-KV parallelism.
In practice
- Evaluate MoQ GGUFs for Qwen3.5 9B for memory-constrained deployment.
- Consider LFM2.5 8B A1B for on-device, multilingual applications.
- Implement TokenSpeed for MLA models on Blackwell GPUs for long contexts.
Topics
- Mixture of Quantization
- GGUF Models
- Mixture-of-Experts
- On-Device AI
- LLM Inference Optimization
- NVIDIA Blackwell GPUs
- Multi-head Latent Attention
Code references
Best for: AI Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.