Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

2026-06-09 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

This article details converting FP8-quantized Contrastive Language-Image Pretraining (CLIP) checkpoints into high-performance NVIDIA TensorRT inference engines. It outlines exporting ModelOpt checkpoints to ONNX format, specifically targeting ONNX opset 20+ and using ModelOpt's helper to fold weight-side Q-DQ pairs, resulting in ONNX files ~34% smaller for the text encoder and ~50% smaller for the image encoder. Benchmarking on an NVIDIA RTX 6000 Ada GPU with TensorRT 10.16 and a static batch size of 128 demonstrated significant performance gains. The image encoder's TensorRT engine size reduced from 588 MB to 306 MB (48% reduction), and its inference latency dropped from 166.2 ms to 119.8 ms (1.39x speedup). The text encoder saw a 34% size reduction (238 MB to 156 MB) and a 1.45x speedup (13.2 ms to 9.1 ms). These improvements are attributed to TensorRT's fusion of Q/DQ nodes into specialized low-precision kernels, optimizing execution on FP8 Tensor Cores.

Key takeaway

For MLOps Engineers deploying large language or vision models, leveraging FP8 quantization with NVIDIA TensorRT offers substantial efficiency gains. You should integrate Model Optimizer and TensorRT into your CI/CD pipeline to reduce model footprint by up to 50% and achieve 1.4x-1.6x inference speedups on Ada GPUs. This approach directly translates to lower operational costs and higher throughput for your production systems. Ensure your target hardware supports FP8 Tensor Cores for optimal results.

Key insights

FP8 quantization with NVIDIA TensorRT dramatically improves inference speed and reduces model footprint for production deployment on compatible GPUs.

Principles

Quantization bridges optimization and production deployment.
TensorRT fuses Q/DQ nodes into specialized FP8 kernels.
FP8 gains derive from Tensor Cores and reduced memory bandwidth.

Method

The workflow involves quantizing a model with ModelOpt, exporting it to ONNX with Q/DQ nodes (opset 20+), then compiling and benchmarking it into an NVIDIA TensorRT engine using "trtexec", ensuring FP8 execution via "--stronglyTyped".

In practice

Use ModelOpt's helper for ONNX export to shrink file size.
Inspect ONNX graphs and profile with Nsight Deep Learning Designer.
Deploy TensorRT engines via standalone runtime or Triton Inference Server.

Topics

Model Quantization
NVIDIA TensorRT
FP8 Precision
ONNX Export
CLIP Model
GPU Inference Optimization

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.