Post-Training Memory Reduction Techniques for Model Inference
Summary
This article details post-training memory reduction techniques for optimizing large language models and other neural networks for inference. It highlights methods to shrink model footprints, such as casting weights from FP32 to 16-bit floating point formats like BF16, which halves memory and enables faster Tensor Core kernels. Further reductions are achieved through INT8 quantization, offering a 4x memory decrease over FP32, with LLM.int8() preserving accuracy for large models. For extreme memory savings, weight-only INT4 quantization (NF4, GPTQ, AWQ) can reduce a 70B model from 140 GB to approximately 35 GB. The article also covers graph compilation with ONNX Runtime and TensorRT to minimize runtime overhead and intermediate allocations, and introduces `torch.compile` for memory-efficient inference. These techniques are applicable to both Hugging Face checkpoints and custom PyTorch models, with a recommended order of operations for implementation.
Key takeaway
For MLOps Engineers deploying large language models, prioritizing post-training memory reduction is crucial for VRAM budget and cost efficiency. You should begin by casting models to BF16, then integrate `torch.compile` for initial overhead reduction. For weight-heavy models, implement weight-only INT4 quantization (e.g., `bnb.nn.Linear4bit`) to achieve significant memory savings. Finally, export the validated model to ONNX Runtime or TensorRT to optimize runtime performance, understanding this step adds engineering complexity.
Key insights
Post-training memory reduction for inference optimizes VRAM, latency, and cost without retraining.
Principles
- BF16 is the default for transformer inference.
- Weight-only INT4 offers the largest LLM memory reduction.
- Exporting to runtimes increases engineering complexity.
Method
Apply techniques sequentially: BF16, then `torch.compile`, then `bnb.nn.Linear4bit` (or FX INT8), finally ONNX/TensorRT export for deployment.
In practice
- Use `torch_dtype=torch.bfloat16` for Hugging Face models.
- Replace `nn.Linear` with `bnb.nn.Linear4bit` for custom models.
- Export models to ONNX for portability and runtime optimization.
Topics
- Model Inference
- Post-Training Quantization
- Large Language Models
- BF16 Precision
- INT4 Quantization
- ONNX Runtime
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.