Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

A new approach called "token-budget-aware pool routing" addresses the inefficiency and reliability issues in vLLM inference fleets, which typically provision every instance for worst-case context length. This leads to 4-8x wasted concurrency and KV-cache failures for the 80-95% of requests that are short. The proposed method estimates each request's total token budget using a self-calibrating, per-category bytes-per-token ratio, then dispatches it to one of two vLLM pools: a high-throughput short pool or a high-capacity long pool, each optimized for its workload class. This online learning mechanism uses exponential moving average from `usage.prompt_tokens` feedback, eliminating the need for a tokenizer. A closed-form cost model, `savings=\alpha\,(1-1/\rho)`, predicts GPU savings based on short-traffic fraction `\alpha` and throughput gain ratio `\rho`. Evaluations on Azure LLM Inference Dataset and LMSYS-Chat-1M with Llama-3-70B on A100 GPUs show 17-39% GPU instance reduction, translating to \$1.2-\2.0 M/yr savings at 1,000 req/s, with a projection of \$15.4M/yr for Qwen3-235B-A22B on AMD MI300X at 10,000 req/s.

Key takeaway

For CTOs or VPs of Engineering managing LLM inference fleets, implementing token-budget-aware pool routing can yield substantial cost savings (17-39% GPU reduction) and enhance reliability by eliminating KV-cache failures. You should start with a two-pool system, routing based on total token budget, and set your `B_short` threshold between 8K-16K tokens, while actively monitoring preemption rates to ensure service level objectives are met.

Key insights

Optimizing LLM inference by routing requests to specialized short or long context pools significantly reduces GPU costs and improves reliability.

Principles

Method

Estimate request token budget using a self-calibrating bytes-per-token ratio, then dispatch to a high-throughput short pool or a high-capacity long pool based on a `B_short` threshold.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.