Streaming Faster Made Our LLM Hub Slower
Summary
LLMesh, an open-source mesh for LLM inference, encountered performance issues when scaling beyond a single user due to its initial per-token streaming architecture. The system, which fans tasks from a hub to agent nodes (Ollama, vLLM, MLX) and streams results via SSE, experienced a bottleneck from excessive POST requests. Each token generated by vLLM (50-200 tok/s) resulted in a separate POST, leading to 2,000 POSTs/second with just 10 concurrent streams, overwhelming the hub. To address this, LLMesh implemented a `StreamBatcher` at the agent level, which batches tokens before sending them to the hub. This batcher uses three flush triggers: a size cap, a time cap (defaulting to 100ms for user-perceived responsiveness), and a "done" sentinel. An adaptive batching mechanism dynamically adjusts batch size to maintain a 100ms latency contract with users and a 10 batches/second throughput contract with the hub, preventing a "death spiral" by measuring the model's actual generation rate rather than round-trip time.
Key takeaway
For AI Engineers building LLM inference systems, optimizing streaming performance under concurrent load is critical. You should implement agent-side token batching with adaptive controls to reduce network overhead and ensure system stability. By measuring the model's true generation rate and smoothing batch size adjustments, you can maintain a responsive user experience while preventing your hub from being overwhelmed by excessive POST requests, especially in multi-tenant environments.
Key insights
Batching LLM tokens at the agent level significantly reduces network overhead and improves system stability under load.
Principles
- Measure producer rate, not round-trip time, in feedback loops.
- Smooth noisy measurements with exponential moving averages.
- Bound buffer sizes in multi-tenant systems to prevent cascading failures.
Method
Implement an adaptive batcher with size and time caps, measuring the producer's actual token generation rate to dynamically adjust batch sizes, ensuring user latency and hub throughput contracts are met.
In practice
- Set `STREAM_BATCH_TIME_MS` to 100ms for perceived UI responsiveness.
- Use `STREAM_BATCH_FIXED=N` for predictable workloads.
- Implement `STREAM_BATCH_MAX_BUFFER` to cap per-stream memory usage.
Topics
- LLM Inference Streaming
- Adaptive Batching
- Backpressure Management
- Control Theory Feedback Loops
- Multi-tenant LLM Infrastructure
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HackerNoon.