Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes
Summary
This analysis evaluates the impact of quantization on the Qwen3.5 model family, which ranges from 0.8B to 397B parameters. It compares various quantization formats, including BF16, FP8, INT4, and NVFP4, across Qwen3.5 9B, 27B, and 35B-A3B models, considering both accuracy and memory usage. The study highlights that while 4-bit models can approach original model accuracy, achieving this is complex, as some model parts are highly sensitive to quantization. For instance, a 4-bit Qwen3.5 27B can be stronger than Qwen3.5 9B with similar memory. The article also provides a practical guide on using AutoRound for INT4 and NVFP4 quantization and serving these models with vLLM for high-throughput inference, detailing GPU setup and software requirements for efficient benchmarking.
Key takeaway
For AI Engineers deploying Qwen3.5 models, carefully consider quantization strategies to balance memory footprint and performance. Prioritize keeping shared expert and linear attention layers in 16-bit precision, especially for reasoning-heavy tasks or long sequence generation, to mitigate accuracy degradation and avoid excessive token generation. Utilize AutoRound and vLLM for efficient quantization and serving, but be aware that evaluation requires significant GPU memory bandwidth.
Key insights
Quantization significantly reduces LLM memory footprint while often retaining substantial accuracy, but careful layer selection is crucial.
Principles
- 4-bit quantization can often match original model accuracy.
- Some model parts are highly sensitive to quantization.
- Evaluation is more compute-intensive than quantization itself.
Method
Quantize Qwen3.5 models to INT4 and NVFP4 using AutoRound, specifying W4A16 or NVFP4 schemes and ignoring sensitive layers like shared experts or linear attention, then serve with vLLM.
In practice
- Use AutoRound for efficient block-by-block quantization.
- Avoid quantizing shared expert layers for MoE models.
- Keep linear attention layers in 16-bit for long sequence reasoning.
Topics
- Qwen3.5 Models
- LLM Quantization
- AutoRound
- GPU Infrastructure
- Model Benchmarking
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.