The Fastest and Cheapest 120B LLM?

2026-04-01 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

This analysis compares three 120B-parameter Mixture-of-Experts (MoE) Large Language Models: Mistral-Small-4-119B-2603, NVIDIA-Nemotron-3-Super-120B-A12B-BF16, and Qwen3.5-122B-A10B, focusing on their NVFP4 quantized versions for performance on Blackwell GPUs. While Qwen3.5 122B generally shows higher accuracy in its full precision form, the NVFP4 quantization narrows the gap, with Nemotron 3 Super performing comparably. Nemotron 3 Super distinguishes itself through superior efficiency, demonstrating higher accuracy with shorter reasoning traces, a smaller KV cache, and faster decoding, making it significantly more cost-effective to operate. Mistral Small 4 lags in both accuracy and efficiency, though it exhibits strengths in specific areas like coding benchmarks. The article also details specific vLLM installation and execution commands for each model's NVFP4 checkpoint.

Key takeaway

For MLOps Engineers deploying large MoE models on Blackwell GPUs, prioritize Nemotron 3 Super 120B-A12B-NVFP4 due to its exceptional token efficiency and faster decoding, which translates directly into lower operational costs. While Qwen3.5 122B-A10B-NVFP4 offers comparable accuracy, Nemotron's efficiency gains make it a more economical choice for high-throughput inference. Ensure proper vLLM configuration, including speculative decoding parameters, to maximize performance and cost savings.

Key insights

Nemotron 3 Super offers superior efficiency and cost-effectiveness in NVFP4, despite Qwen3.5's slightly higher base accuracy.

Principles

Quantization impacts model accuracy and efficiency differently across architectures.
Token efficiency significantly reduces operational costs for LLMs.
Hybrid architectures can optimize for specific performance characteristics.

Method

The analysis compares NVFP4 quantized versions of 120B-parameter MoE LLMs, assessing memory footprint, accuracy across benchmarks (with and without "thinking"), and inference speed using vLLM.

In practice

Use NVFP4 quantization for efficient LLM deployment on Blackwell GPUs.
Configure vLLM with specific flags for each MoE model (e.g., `--moe_backend flashinfer_cutlass`).
Tune `num_speculative_tokens` for MTP to optimize inference speed.

Topics

Large Language Models
Mixture-of-Experts
NVFP4 Quantization
Model Efficiency
vLLM Inference

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.