Streaming Faster Made Our LLM Hub Slower

· Source: HackerNoon · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

LLMesh, an open-source mesh for LLM inference, encountered performance issues when scaling beyond a single user due to its initial per-token streaming architecture. The system, which fans tasks from a hub to agent nodes (Ollama, vLLM, MLX) and streams results via SSE, experienced a bottleneck from excessive POST requests. Each token generated by vLLM (50-200 tok/s) resulted in a separate POST, leading to 2,000 POSTs/second with just 10 concurrent streams, overwhelming the hub. To address this, LLMesh implemented a `StreamBatcher` at the agent level, which batches tokens before sending them to the hub. This batcher uses three flush triggers: a size cap, a time cap (defaulting to 100ms for user-perceived responsiveness), and a "done" sentinel. An adaptive batching mechanism dynamically adjusts batch size to maintain a 100ms latency contract with users and a 10 batches/second throughput contract with the hub, preventing a "death spiral" by measuring the model's actual generation rate rather than round-trip time.

Key takeaway

For AI Engineers building LLM inference systems, optimizing streaming performance under concurrent load is critical. You should implement agent-side token batching with adaptive controls to reduce network overhead and ensure system stability. By measuring the model's true generation rate and smoothing batch size adjustments, you can maintain a responsive user experience while preventing your hub from being overwhelmed by excessive POST requests, especially in multi-tenant environments.

Key insights

Batching LLM tokens at the agent level significantly reduces network overhead and improves system stability under load.

Principles

Method

Implement an adaptive batcher with size and time caps, measuring the producer's actual token generation rate to dynamically adjust batch sizes, ensuring user latency and hub throughput contracts are met.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HackerNoon.