Lessons from GGUF Evaluations: Ternary Qwen3.5, Bricked Minimax

2025-07-07 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

This report details an evaluation of GGUF quantized checkpoints for Qwen3.5 397B-A17B and Minimax M2.5 models, highlighting significant differences in their robustness to quantization. The evaluation, conducted on H200 GPUs at $2.59/hour, involved subsets of MMLU-Pro, GPQA Diamond, LiveCodeBench v6, and Math-500 benchmarks. Qwen3.5 397B-A17B showed remarkable resilience, with ternary weights (TQ1_0) increasing benchmark error by only ~18.4% while reducing memory from ~800 GB to ~94 GB. In contrast, Minimax M2.5 experienced severe performance degradation even with Q4 variants. The author also discusses the challenges of quantizing Qwen3.5 27B due to tooling incompatibilities and announces the release of NVFP4, MXFP4, and INT4 versions compatible with vLLM, along with a brief mention of Liquid AI's new LFM2 24B A2B model.

Key takeaway

For AI Engineers evaluating quantized large language models for deployment, you should prioritize empirical evaluation using real-world benchmarks that involve token generation. Do not rely solely on perplexity or assume uniform robustness across models, as some, like Minimax M2.5, degrade severely even with common quantization schemes, potentially leading to impaired performance in production.

Key insights

Model robustness to quantization varies significantly, necessitating empirical evaluation beyond perplexity metrics.

Principles

Not all models tolerate aggressive low-bit quantization equally.
Perplexity metrics are insufficient for evaluating quantization quality.

Method

Evaluate GGUF models by prompting with real benchmarks and generating tokens, rather than relying solely on logits or perplexity.

In practice

Test quantized models against full-precision originals.
Use real-world benchmarks for quantization quality assessment.

Topics

LLM Quantization
GGUF Evaluation
Qwen3.5
Minimax M2.5
LFM2 Architecture

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.