LLM Serving Fairness: No more noisy neighbours - Cohere

· Source: cohere.com via Google News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, medium

Summary

Cohere has introduced a "Serving Fairness" solution to mitigate the "noisy neighbor" problem in its multi-tenant LLM SaaS platform, where bursty traffic from one customer can impact others' latency. This system combines four mechanisms: a Rate Limiter, Performance Tiering, Deficit Round Robin (DRR) scheduling, and a Priority selector. The Rate Limiter manages admission by capping requests per tenant and model, rejecting those unable to meet latency targets. Performance Tiering prioritizes requests based on service-level agreements. The central Deficit Round Robin algorithm ensures equitable compute distribution within each tier, granting tenants a budget (quantum) per turn, debited by request cost. This budgeting can be request-based for generative models or token-based for batched endpoints like embeddings, reflecting actual work. A Priority selector then orders requests within a tenant's queue by explicit priority, deadline, and arrival time. This integrated approach ensures predictable inference performance based on tier and fair capacity share, now enabled for all Cohere SaaS API and marketplace customers.

Key takeaway

For MLOps Engineers designing multi-tenant LLM inference platforms, Cohere's layered approach to serving fairness offers a robust blueprint. You should consider integrating admission control, tiered service levels, and a Deficit Round Robin scheduler with adaptable cost models (request- or token-based) to prevent "noisy neighbor" issues. This ensures predictable latency and resource allocation for your customers, even under bursty traffic, by preserving internal priority and deadline guarantees within each tenant's fair share.

Key insights

Cohere's multi-layered scheduling system ensures fair LLM inference capacity distribution across tenants, preventing "noisy neighbor" issues.

Principles

Method

Cohere's method involves sequential application of Rate Limiting, Performance Tiering, Deficit Round Robin (DRR) for tenant-level fairness, and internal Priority queues for per-tenant urgency.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cohere.com via Google News.