Threshold-Based Exclusive Batching for LLM Inference
Summary
A new scheduling strategy, Threshold-Based Exclusive Batching (EB), addresses efficiency issues in Large Language Model (LLM) inference, particularly concerning Mixed Batching (MB). While MB, which interleaves prefill and decode, is standard for maximizing compute utilization, experiments reveal that prefill-decode interference increases MB's per-step marginal cost. This effect is pronounced on bandwidth-constrained GPUs like the RTX PRO 6000 (1.792 TB/s), where interference occurs when decode tokens exceed 20% of the batch. On high-bandwidth H200 (4.8 TB/s), this threshold is 80%. The optimal choice between EB and MB depends on GPU memory bandwidth, model size, and workload. A closed-form condition determines the EB-MB performance crossover. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB remains superior for larger models on high-bandwidth hardware. The hybrid EB+ scheduler dynamically applies this condition, achieving up to 36.4% higher throughput than MB under non-stationary traffic.
Key takeaway
For MLOps Engineers optimizing LLM inference, you should evaluate your GPU's memory bandwidth and workload characteristics. If you operate on bandwidth-constrained hardware like an RTX PRO 6000, consider implementing Exclusive Batching or a dynamic hybrid scheduler like EB+. This can significantly boost throughput by up to 41.9% compared to standard Mixed Batching, especially under fluctuating traffic patterns.
Key insights
Optimal LLM inference batching depends on GPU bandwidth and workload, with a hybrid scheduler dynamically switching for superior throughput.
Principles
- Prefill-decode interference inflates Mixed Batching costs.
- GPU memory bandwidth dictates optimal batching strategy.
- Dynamic scheduling adapts to non-stationary LLM traffic.
Method
A closed-form condition determines optimal switching between Exclusive and Mixed Batching, considering GPU bandwidth and workload. This condition, with phase-switching thresholds and memory-safe sizing, powers the dynamic EB+ hybrid scheduler.
In practice
- Use Exclusive Batching on bandwidth-constrained GPUs.
- Apply hybrid scheduling for variable LLM traffic.
- Consider GPU bandwidth when selecting batching strategy.
Topics
- LLM Inference
- Batching Optimization
- GPU Memory Bandwidth
- Dynamic Scheduling
- Mixed Batching
- Exclusive Batching
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.