Qwen3.5 9B MoQ: Inside a Strong 3.6-bit GGUF

2026-04-15 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

This intelligence brief details three advancements in AI model efficiency and performance. It first analyzes Qwen3.5 9B MoQ GGUFs, a 3.6-bit quantized model demonstrating superior benchmark performance and token efficiency compared to UD-Q2_K_XL. The MoQ model uses a more uniform 3-bit allocation across transformer tensors, embeddings, and output, achieving a 3.5877 bpw average. Second, LiquidAI's LFM2.5 8B A1B, an 8.3 billion parameter Mixture-of-Experts model with 1.5 billion active parameters, achieves high CPU decode speeds (253 tokens/sec on M5 Max) under 6 GB memory. It was trained on 38 trillion tokens, features a 128K context window, and a doubled 128,000-token vocabulary for multilingual efficiency. Finally, TokenSpeed is introduced as an inference system for long-context, short-query decode on MLA models, optimized for NVIDIA Blackwell GPUs. It improves attention kernel efficiency by reshaping computations and using split-KV parallelism, achieving up to 580 tps for Qwen3.5 397B A17B.

Key takeaway

For MLOps Engineers optimizing LLM deployment on edge devices or with long contexts, consider evaluating these new techniques. Qwen3.5 9B MoQ GGUFs offer strong accuracy at 3.6-bit average, potentially reducing memory footprint. LiquidAI's LFM2.5 8B A1B provides a performant, multilingual MoE solution for on-device inference. Furthermore, if you are running MLA models on NVIDIA Blackwell GPUs, integrate TokenSpeed to significantly boost long-context, short-query decode speeds.

Key insights

Advanced quantization and MoE architectures significantly boost LLM efficiency and performance on diverse hardware.

Principles

Uniform low-bit quantization can outperform heterogeneous schemes.
Small MoE models can achieve high performance with extensive training.
Optimizing attention kernels improves long-context inference speed.

Method

MoQ involves a compact, smoother allocation of 3-bit quantization across most major transformer blocks, embeddings, and output layers, avoiding high-bit exceptions. TokenSpeed optimizes attention by folding query length into head dimension and using split-KV parallelism.

In practice

Evaluate MoQ GGUFs for Qwen3.5 9B for memory-constrained deployment.
Consider LFM2.5 8B A1B for on-device, multilingual applications.
Implement TokenSpeed for MLA models on Blackwell GPUs for long contexts.

Topics

Mixture of Quantization
GGUF Models
Mixture-of-Experts
On-Device AI
LLM Inference Optimization
NVIDIA Blackwell GPUs
Multi-head Latent Attention

Code references

lightseekorg/tokenspeed

Best for: AI Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Scientist, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.