The Fastest and Cheapest 120B LLM?
Summary
This analysis compares three 120B-parameter Mixture-of-Experts (MoE) Large Language Models: Mistral-Small-4-119B-2603, NVIDIA-Nemotron-3-Super-120B-A12B-BF16, and Qwen3.5-122B-A10B, focusing on their NVFP4 quantized versions for performance on Blackwell GPUs. While Qwen3.5 122B generally shows higher accuracy in its full precision form, the NVFP4 quantization narrows the gap, with Nemotron 3 Super performing comparably. Nemotron 3 Super distinguishes itself through superior efficiency, demonstrating higher accuracy with shorter reasoning traces, a smaller KV cache, and faster decoding, making it significantly more cost-effective to operate. Mistral Small 4 lags in both accuracy and efficiency, though it exhibits strengths in specific areas like coding benchmarks. The article also details specific vLLM installation and execution commands for each model's NVFP4 checkpoint.
Key takeaway
For MLOps Engineers deploying large MoE models on Blackwell GPUs, prioritize Nemotron 3 Super 120B-A12B-NVFP4 due to its exceptional token efficiency and faster decoding, which translates directly into lower operational costs. While Qwen3.5 122B-A10B-NVFP4 offers comparable accuracy, Nemotron's efficiency gains make it a more economical choice for high-throughput inference. Ensure proper vLLM configuration, including speculative decoding parameters, to maximize performance and cost savings.
Key insights
Nemotron 3 Super offers superior efficiency and cost-effectiveness in NVFP4, despite Qwen3.5's slightly higher base accuracy.
Principles
- Quantization impacts model accuracy and efficiency differently across architectures.
- Token efficiency significantly reduces operational costs for LLMs.
- Hybrid architectures can optimize for specific performance characteristics.
Method
The analysis compares NVFP4 quantized versions of 120B-parameter MoE LLMs, assessing memory footprint, accuracy across benchmarks (with and without "thinking"), and inference speed using vLLM.
In practice
- Use NVFP4 quantization for efficient LLM deployment on Blackwell GPUs.
- Configure vLLM with specific flags for each MoE model (e.g., `--moe_backend flashinfer_cutlass`).
- Tune `num_speculative_tokens` for MTP to optimize inference speed.
Topics
- Large Language Models
- Mixture-of-Experts
- NVFP4 Quantization
- Model Efficiency
- vLLM Inference
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.