Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

2026-05-07 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

NVIDIA Model Optimizer (ModelOpt) is a library designed to compress and accelerate AI models through techniques like quantization, distillation, and pruning. This post details how to use ModelOpt for post-training quantization (PTQ) of a CLIP model to FP8 format, aiming to reduce VRAM usage and improve inference performance on consumer GPUs like NVIDIA GeForce RTX. The process involves preparing the CLIP-ViT-L-14-laion2B-s32B-b82K model and a 10K subset of the MS-COCO dataset for calibration. A code example demonstrates FP8 (E4M3) per-tensor static quantization using the AbsMax algorithm, including specific handling for CLIPAttention layers. Evaluation on CIFAR-100, ImageNet-1k, and MS-COCO Captions benchmarks shows that the FP8 quantized CLIP model maintains comparable quality to its FP16 baseline, especially when quantizers are selectively disabled in layers like patch embedding.

Key takeaway

For ML Engineers optimizing vision language models for deployment on NVIDIA GPUs, consider using NVIDIA Model Optimizer to apply FP8 post-training quantization. This approach can significantly reduce memory footprint and enhance inference speed without substantial accuracy loss, as demonstrated with CLIP. Your team should experiment with disabling quantizers in critical layers, like patch embeddings, to fine-tune the balance between performance and model quality for your specific application.

Key insights

NVIDIA Model Optimizer enables efficient FP8 post-training quantization for models like CLIP, preserving quality while reducing resource demands.

Principles

Quantization reduces VRAM and improves inference.
Fake quantization simulates precision loss for evaluation.
Iterative refinement optimizes quantization configurations.

Method

The ModelOpt PTQ flow involves preparing models and data, setting quantization config, calibrating with representative data, performing fake quantization, evaluating accuracy, iterating on configurations, and exporting for deployment with engines like TensorRT.

In practice

Use ModelOpt for FP8 PTQ on Hugging Face, PyTorch, or ONNX models.
Register custom quantized modules for attention blocks.
Disable quantizers in sensitive layers to restore accuracy.

Topics

NVIDIA Model Optimizer
Model Quantization
Post-Training Quantization
CLIP Model
FP8 Quantization

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.