Compute scarcity is an engineering problem

· Source: Air Street Press · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, medium

Summary

Angelos Perivolaropoulos of ElevenLabs presented at RAAIS on optimizing GPU utilization for voice inference workloads amidst GPU scarcity. His talk detailed methods to increase users served per GPU from one to seventy with standard engineering, and up to one hundred forty with architectural changes. He explained that token cost in autoregressive transformers is bottlenecked by compute (prefill phase) and memory bandwidth (decode phase), with KV cache size being a critical factor. Key optimizations include continuous batching, which boosts throughput from one to fifteen users per GPU, and quantization (FP8) to twenty users. Further gains come from speculative decoding or multi-token prediction (twenty-eight users), and KV cache compression with distilled models (seventy users). Frontier labs achieve up to one hundred forty users per GPU through architectural changes like DeepSeek's multi-head latent attention, Qwen's linear networks, and NVIDIA's state-space models. Perivolaropoulos emphasized that each optimization has costs and real-world performance can differ from benchmarks.

Key takeaway

For MLOps Engineers scaling LLM inference, prioritize continuous batching to significantly boost GPU utilization from one to fifteen users. Subsequently, implement FP8 quantization and explore multi-token prediction to reach twenty-eight users per GPU. Consider KV cache compression with distilled models for up to seventy users, but carefully evaluate its potential for accuracy degradation in production. These optimizations are crucial for managing GPU scarcity and improving cost-efficiency, especially as token prices are currently subsidized.

Key insights

GPU scarcity necessitates deep engineering optimization to maximize users served per GPU for LLM inference.

Principles

Method

Optimize LLM inference by first implementing continuous batching, then applying quantization (e.g., FP8), followed by speculative decoding or multi-token prediction, and finally, KV cache compression with distilled models.

In practice

Topics

Best for: NLP Engineer, AI Architect, CTO, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Air Street Press.