LLM Serving Fairness: No more noisy neighbours - Cohere
Summary
Cohere has introduced a "Serving Fairness" solution to mitigate the "noisy neighbor" problem in its multi-tenant LLM SaaS platform, where bursty traffic from one customer can impact others' latency. This system combines four mechanisms: a Rate Limiter, Performance Tiering, Deficit Round Robin (DRR) scheduling, and a Priority selector. The Rate Limiter manages admission by capping requests per tenant and model, rejecting those unable to meet latency targets. Performance Tiering prioritizes requests based on service-level agreements. The central Deficit Round Robin algorithm ensures equitable compute distribution within each tier, granting tenants a budget (quantum) per turn, debited by request cost. This budgeting can be request-based for generative models or token-based for batched endpoints like embeddings, reflecting actual work. A Priority selector then orders requests within a tenant's queue by explicit priority, deadline, and arrival time. This integrated approach ensures predictable inference performance based on tier and fair capacity share, now enabled for all Cohere SaaS API and marketplace customers.
Key takeaway
For MLOps Engineers designing multi-tenant LLM inference platforms, Cohere's layered approach to serving fairness offers a robust blueprint. You should consider integrating admission control, tiered service levels, and a Deficit Round Robin scheduler with adaptable cost models (request- or token-based) to prevent "noisy neighbor" issues. This ensures predictable latency and resource allocation for your customers, even under bursty traffic, by preserving internal priority and deadline guarantees within each tenant's fair share.
Key insights
Cohere's multi-layered scheduling system ensures fair LLM inference capacity distribution across tenants, preventing "noisy neighbor" issues.
Principles
- Multi-tenant LLM serving requires robust fairness mechanisms.
- Batching efficiency must coexist with fair resource allocation.
- Resource allocation can be weighted by request count or token cost.
Method
Cohere's method involves sequential application of Rate Limiting, Performance Tiering, Deficit Round Robin (DRR) for tenant-level fairness, and internal Priority queues for per-tenant urgency.
In practice
- Implement admission control to prevent system overload.
- Use DRR with token-based budgeting for batched endpoints.
- Prioritize requests within tenant queues by deadline.
Topics
- LLM Serving Fairness
- Multi-tenant Inference
- Deficit Round Robin
- Resource Scheduling
- Rate Limiting
- Cohere API
Best for: MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cohere.com via Google News.