Qwen3.5 9B MoQ: Inside a Strong 3.6-bit GGUF

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

This intelligence brief details three advancements in AI model efficiency and performance. It first analyzes Qwen3.5 9B MoQ GGUFs, a 3.6-bit quantized model demonstrating superior benchmark performance and token efficiency compared to UD-Q2_K_XL. The MoQ model uses a more uniform 3-bit allocation across transformer tensors, embeddings, and output, achieving a 3.5877 bpw average. Second, LiquidAI's LFM2.5 8B A1B, an 8.3 billion parameter Mixture-of-Experts model with 1.5 billion active parameters, achieves high CPU decode speeds (253 tokens/sec on M5 Max) under 6 GB memory. It was trained on 38 trillion tokens, features a 128K context window, and a doubled 128,000-token vocabulary for multilingual efficiency. Finally, TokenSpeed is introduced as an inference system for long-context, short-query decode on MLA models, optimized for NVIDIA Blackwell GPUs. It improves attention kernel efficiency by reshaping computations and using split-KV parallelism, achieving up to 580 tps for Qwen3.5 397B A17B.

Key takeaway

For MLOps Engineers optimizing LLM deployment on edge devices or with long contexts, consider evaluating these new techniques. Qwen3.5 9B MoQ GGUFs offer strong accuracy at 3.6-bit average, potentially reducing memory footprint. LiquidAI's LFM2.5 8B A1B provides a performant, multilingual MoE solution for on-device inference. Furthermore, if you are running MLA models on NVIDIA Blackwell GPUs, integrate TokenSpeed to significantly boost long-context, short-query decode speeds.

Key insights

Advanced quantization and MoE architectures significantly boost LLM efficiency and performance on diverse hardware.

Principles

Method

MoQ involves a compact, smoother allocation of 3-bit quantization across most major transformer blocks, embeddings, and output layers, avoiding high-bit exceptions. TokenSpeed optimizes attention by folding query length into head dimension and using split-KV parallelism.

In practice

Topics

Code references

Best for: AI Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.