Qwen3.5 27B Latency and Throughput: INT4 vs NVFP4 vs FP8 vs BF16

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, quick

Summary

This analysis benchmarks the inference speed of Qwen3.5 27B across various quantization formats (BF16, FP8, NVFP4, INT4) on three NVIDIA GPUs: RTX Pro 6000, H100, and B200. It investigates performance in both synchronous (single request) and saturated workload scenarios. While quantized models generally maintain accuracy close to original models, their primary benefit lies in reduced memory usage, with INT4 typically consuming the least. Newer hybrid-attention models like Qwen3.5 significantly reduce KV cache size compared to full-attention predecessors, enabling more concurrent requests. However, the study highlights that high concurrency doesn't always equate to efficient serving due to potential GPU memory bandwidth saturation, emphasizing the importance of inference speed and latency. The benchmarks utilized vLLM for model serving and GuideLLM for detailed LLM-specific metrics, focusing on a configuration of 1,000 prompt tokens and 1,000 output tokens.

Key takeaway

For NLP Engineers optimizing LLM deployment, recognize that while quantization significantly reduces memory footprint and improves single-query latency, it does not inherently solve latency issues under heavy, saturated workloads. You should benchmark your specific model and hardware configurations, focusing on both memory efficiency and inference speed metrics like time to first token and inter-token latency, especially when targeting high concurrency with hybrid-attention architectures.

Key insights

Quantization improves LLM memory efficiency and single-query latency, but saturated workloads face GPU memory bandwidth limits.

Principles

Method

Benchmarking Qwen3.5 27B in BF16, FP8, NVFP4, and INT4 on RTX Pro 6000, H100, and B200 GPUs using vLLM and GuideLLM for synchronous and saturated workloads.

In practice

Topics

Code references

Best for: NLP Engineer, Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.