Compute scarcity is an engineering problem
Summary
Angelos Perivolaropoulos of ElevenLabs presented at RAAIS on optimizing GPU utilization for voice inference workloads amidst GPU scarcity. His talk detailed methods to increase users served per GPU from one to seventy with standard engineering, and up to one hundred forty with architectural changes. He explained that token cost in autoregressive transformers is bottlenecked by compute (prefill phase) and memory bandwidth (decode phase), with KV cache size being a critical factor. Key optimizations include continuous batching, which boosts throughput from one to fifteen users per GPU, and quantization (FP8) to twenty users. Further gains come from speculative decoding or multi-token prediction (twenty-eight users), and KV cache compression with distilled models (seventy users). Frontier labs achieve up to one hundred forty users per GPU through architectural changes like DeepSeek's multi-head latent attention, Qwen's linear networks, and NVIDIA's state-space models. Perivolaropoulos emphasized that each optimization has costs and real-world performance can differ from benchmarks.
Key takeaway
For MLOps Engineers scaling LLM inference, prioritize continuous batching to significantly boost GPU utilization from one to fifteen users. Subsequently, implement FP8 quantization and explore multi-token prediction to reach twenty-eight users per GPU. Consider KV cache compression with distilled models for up to seventy users, but carefully evaluate its potential for accuracy degradation in production. These optimizations are crucial for managing GPU scarcity and improving cost-efficiency, especially as token prices are currently subsidized.
Key insights
GPU scarcity necessitates deep engineering optimization to maximize users served per GPU for LLM inference.
Principles
- Batching is the biggest single win.
- Memory is the next constraint.
- Every optimization carries a cost.
Method
Optimize LLM inference by first implementing continuous batching, then applying quantization (e.g., FP8), followed by speculative decoding or multi-token prediction, and finally, KV cache compression with distilled models.
In practice
- Implement continuous batching.
- Use FP8 quantization for weights.
- Distill models for FP8 KV cache.
Topics
- GPU Optimization
- LLM Inference
- Continuous Batching
- Quantization
- KV Cache Compression
- Model Architecture
Best for: NLP Engineer, AI Architect, CTO, Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Air Street Press.