Article: Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent
Summary
An adaptive hedged request mechanism significantly reduces p99 latency by 74%, dropping from 65.0 ms to 17.3 ms in simulated benchmarks. This approach targets "stragglers"—requests that complete slowly rather than fail—which are a primary driver of tail latency in fan-out architectures, where traditional retries exacerbate the problem. The system employs DDSketch for real-time, O(1) constant-memory quantile estimation, ensuring relative-error guarantees of plus or minus one percent. It uses a tumbling window of two sketches, rotating every 30 seconds, to dynamically adapt to shifting latency distributions caused by load or deployments. A token bucket budget, configurable to cap the hedge rate at a percentage like ten percent of total traffic, prevents load amplification during genuine outages. The mechanism also extends to LLM inference by measuring Time to First Token (TTFT) for accurate latency signals. A reference Go library, "bhope/hedge", provides drop-in HTTP and gRPC support.
Key takeaway
For MLOps Engineers or Software Engineers managing distributed microservices, if you are struggling with tail latency (p99) due to slow requests, consider implementing adaptive hedged requests. This approach, unlike traditional retries, proactively races around stragglers using real-time latency data, significantly improving user experience without manual tuning. Ensure your operations are idempotent and monitor for potential load amplification during genuine outages, leveraging the built-in token bucket budget.
Key insights
Adaptive hedging, using real-time latency distribution learning and a budget, effectively mitigates straggler-induced tail latency without manual tuning.
Principles
- Stragglers, not failures, drive p99 latency in fan-out systems.
- Static latency thresholds fail in dynamic production environments.
- Hedging requires a budget to prevent load amplification during outages.
Method
Dispatch primary request, set timer based on DDSketch's p90 latency estimate. If timer fires and token is available, send a hedge request. Return first response, cancel other.
In practice
- Use "bhope/hedge" Go library for HTTP/gRPC services.
- Calibrate LLM inference hedging on Time to First Token (TTFT).
- Only hedge idempotent operations to avoid side effects.
Topics
- Adaptive Hedging
- Tail Latency Optimization
- DDSketch
- Microservices Architecture
- LLM Inference
- Distributed Systems
Code references
Best for: Software Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.