Qwen3.5 27B Latency and Throughput: INT4 vs NVFP4 vs FP8 vs BF16
Summary
This analysis benchmarks the inference speed of Qwen3.5 27B across various quantization formats (BF16, FP8, NVFP4, INT4) on three NVIDIA GPUs: RTX Pro 6000, H100, and B200. It investigates performance in both synchronous (single request) and saturated workload scenarios. While quantized models generally maintain accuracy close to original models, their primary benefit lies in reduced memory usage, with INT4 typically consuming the least. Newer hybrid-attention models like Qwen3.5 significantly reduce KV cache size compared to full-attention predecessors, enabling more concurrent requests. However, the study highlights that high concurrency doesn't always equate to efficient serving due to potential GPU memory bandwidth saturation, emphasizing the importance of inference speed and latency. The benchmarks utilized vLLM for model serving and GuideLLM for detailed LLM-specific metrics, focusing on a configuration of 1,000 prompt tokens and 1,000 output tokens.
Key takeaway
For NLP Engineers optimizing LLM deployment, recognize that while quantization significantly reduces memory footprint and improves single-query latency, it does not inherently solve latency issues under heavy, saturated workloads. You should benchmark your specific model and hardware configurations, focusing on both memory efficiency and inference speed metrics like time to first token and inter-token latency, especially when targeting high concurrency with hybrid-attention architectures.
Key insights
Quantization improves LLM memory efficiency and single-query latency, but saturated workloads face GPU memory bandwidth limits.
Principles
- Lower memory consumption enables larger KV caches and more concurrent requests.
- Inference speed and latency are as critical as memory efficiency for LLM serving.
- Reducing weight traffic is more impactful than adding compute for single-query latency.
Method
Benchmarking Qwen3.5 27B in BF16, FP8, NVFP4, and INT4 on RTX Pro 6000, H100, and B200 GPUs using vLLM and GuideLLM for synchronous and saturated workloads.
In practice
- Prioritize 4-bit quantization for optimal single-query latency.
- Consider NVFP4 for B200 GPUs for best single-query results.
- Use GuideLLM for LLM-specific inference benchmarking.
Topics
- Qwen3.5 27B
- LLM Quantization
- Inference Benchmarking
- GPU Performance
- NVFP4
Code references
Best for: NLP Engineer, Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.