Article: Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent

· Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

An adaptive hedged request mechanism significantly reduces p99 latency by 74%, dropping from 65.0 ms to 17.3 ms in simulated benchmarks. This approach targets "stragglers"—requests that complete slowly rather than fail—which are a primary driver of tail latency in fan-out architectures, where traditional retries exacerbate the problem. The system employs DDSketch for real-time, O(1) constant-memory quantile estimation, ensuring relative-error guarantees of plus or minus one percent. It uses a tumbling window of two sketches, rotating every 30 seconds, to dynamically adapt to shifting latency distributions caused by load or deployments. A token bucket budget, configurable to cap the hedge rate at a percentage like ten percent of total traffic, prevents load amplification during genuine outages. The mechanism also extends to LLM inference by measuring Time to First Token (TTFT) for accurate latency signals. A reference Go library, "bhope/hedge", provides drop-in HTTP and gRPC support.

Key takeaway

For MLOps Engineers or Software Engineers managing distributed microservices, if you are struggling with tail latency (p99) due to slow requests, consider implementing adaptive hedged requests. This approach, unlike traditional retries, proactively races around stragglers using real-time latency data, significantly improving user experience without manual tuning. Ensure your operations are idempotent and monitor for potential load amplification during genuine outages, leveraging the built-in token bucket budget.

Key insights

Adaptive hedging, using real-time latency distribution learning and a budget, effectively mitigates straggler-induced tail latency without manual tuning.

Principles

Method

Dispatch primary request, set timer based on DDSketch's p90 latency estimate. If timer fires and token is available, send a hedge request. Return first response, cancel other.

In practice

Topics

Code references

Best for: Software Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.