Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
Summary
A new approach called "token-budget-aware pool routing" addresses the inefficiency and reliability issues in vLLM inference fleets, which typically provision every instance for worst-case context length. This leads to 4-8x wasted concurrency and KV-cache failures for the 80-95% of requests that are short. The proposed method estimates each request's total token budget using a self-calibrating, per-category bytes-per-token ratio, then dispatches it to one of two vLLM pools: a high-throughput short pool or a high-capacity long pool, each optimized for its workload class. This online learning mechanism uses exponential moving average from `usage.prompt_tokens` feedback, eliminating the need for a tokenizer. A closed-form cost model, `savings=\alpha\,(1-1/\rho)`, predicts GPU savings based on short-traffic fraction `\alpha` and throughput gain ratio `\rho`. Evaluations on Azure LLM Inference Dataset and LMSYS-Chat-1M with Llama-3-70B on A100 GPUs show 17-39% GPU instance reduction, translating to \$1.2-\2.0 M/yr savings at 1,000 req/s, with a projection of \$15.4M/yr for Qwen3-235B-A22B on AMD MI300X at 10,000 req/s.
Key takeaway
For CTOs or VPs of Engineering managing LLM inference fleets, implementing token-budget-aware pool routing can yield substantial cost savings (17-39% GPU reduction) and enhance reliability by eliminating KV-cache failures. You should start with a two-pool system, routing based on total token budget, and set your `B_short` threshold between 8K-16K tokens, while actively monitoring preemption rates to ensure service level objectives are met.
Key insights
Optimizing LLM inference by routing requests to specialized short or long context pools significantly reduces GPU costs and improves reliability.
Principles
- Match configuration to traffic patterns
- Right-size resources for workload classes
- Calibrate parameters online from feedback
Method
Estimate request token budget using a self-calibrating bytes-per-token ratio, then dispatch to a high-throughput short pool or a high-capacity long pool based on a `B_short` threshold.
In practice
- Use `B_short` between 8K-16K for optimal savings
- Route on `L_total` (input + output tokens)
- Monitor preemption rate, not just utilization
Topics
- Token-Budget Pool Routing
- vLLM Inference Optimization
- GPU Cost Efficiency
- KV-Cache Management
- Self-Calibrating Token Estimation
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.