Lessons from GGUF Evaluations: Ternary Qwen3.5, Bricked Minimax
Summary
This report details an evaluation of GGUF quantized checkpoints for Qwen3.5 397B-A17B and Minimax M2.5 models, highlighting significant differences in their robustness to quantization. The evaluation, conducted on H200 GPUs at $2.59/hour, involved subsets of MMLU-Pro, GPQA Diamond, LiveCodeBench v6, and Math-500 benchmarks. Qwen3.5 397B-A17B showed remarkable resilience, with ternary weights (TQ1_0) increasing benchmark error by only ~18.4% while reducing memory from ~800 GB to ~94 GB. In contrast, Minimax M2.5 experienced severe performance degradation even with Q4 variants. The author also discusses the challenges of quantizing Qwen3.5 27B due to tooling incompatibilities and announces the release of NVFP4, MXFP4, and INT4 versions compatible with vLLM, along with a brief mention of Liquid AI's new LFM2 24B A2B model.
Key takeaway
For AI Engineers evaluating quantized large language models for deployment, you should prioritize empirical evaluation using real-world benchmarks that involve token generation. Do not rely solely on perplexity or assume uniform robustness across models, as some, like Minimax M2.5, degrade severely even with common quantization schemes, potentially leading to impaired performance in production.
Key insights
Model robustness to quantization varies significantly, necessitating empirical evaluation beyond perplexity metrics.
Principles
- Not all models tolerate aggressive low-bit quantization equally.
- Perplexity metrics are insufficient for evaluating quantization quality.
Method
Evaluate GGUF models by prompting with real benchmarks and generating tokens, rather than relying solely on logits or perplexity.
In practice
- Test quantized models against full-precision originals.
- Use real-world benchmarks for quantization quality assessment.
Topics
- LLM Quantization
- GGUF Evaluation
- Qwen3.5
- Minimax M2.5
- LFM2 Architecture
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.