Optimizing LLM latency
Summary
This analysis, published July 14, 2025, evaluates various tools and platforms for optimizing latency in open-source Large Language Models (LLMs), specifically using the `meta-llama/Llama-2-7b-hf` model on an Nvidia A6000 GPU with a batch size of one and a maximum of 200 output tokens. The study found MLC to be the fastest, achieving 117.1 tokens/second, though its quality was not fully assessed. CTranslate2 emerged as a favored option, balancing speed (up to 62.6 tokens/second with int8 quantization) with ease of use. vLLM demonstrated strong performance (46.4 tokens/second on A6000) and is highlighted for its distributed inference capabilities, making it suitable for very large models. Text Generation Inference (TGI) was an "ok" option (21.1 tokens/second), offering HuggingFace ecosystem integration and telemetry, but its restrictive license and slower performance with `bitsandbytes` quantization were noted. The analysis also briefly touched on HuggingFace Transformers and Text Generation WebUI.
Key takeaway
For AI Engineers focused on minimizing LLM inference latency, consider CTranslate2 for its balance of speed and usability, especially with int8 quantization. If your models require distributed inference or are exceptionally large, vLLM is likely your optimal choice despite potential setup complexities. Be cautious with Text Generation Inference (TGI) due to its restrictive license and slower `bitsandbytes` performance, though it offers ecosystem integration.
Key insights
MLC, CTranslate2, and vLLM offer significant latency improvements for LLM inference, with varying trade-offs.
Principles
- Quantization can significantly boost inference speed.
- Distributed inference is crucial for very large models.
- Inference servers often include optimization techniques.
Method
Benchmarking LLM inference latency involves fixing batch size, output tokens, and GPU, then measuring tokens/second across various optimization tools and servers using a consistent set of prompts.
In practice
- Use `ct2-transformers-converter` for CTranslate2 quantization.
- Configure `GPTQ_BITS` and `GPTQ_GROUPSIZE` for TGI pre-quantized models.
- Install `vLLM` from git for the latest features and CUDA 11.8 compatibility.
Topics
- LLM Inference Latency
- Model Optimization
- Inference Servers
- Quantization Techniques
- Distributed Inference
Code references
- OpenNMT/CTranslate2
- huggingface/text-generation-inference
- vllm-project/vllm
- vllm-project/vllm
- turboderp/exllama
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hamel Husain's Blog.