Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

Frontier LLMs increasingly use sparse-attention indexers to select KV-cache blocks per query, a unit that agentic workloads frequently reuse across large codebases. When these corpora exceed a single GPU's capacity, they are partitioned across instances, necessitating cross-instance attention. Traditional cross-instance KV systems move selected cache blocks to the requester. However, Multi-head Latent Attention (MLA) compresses each token's key and value into a narrow vector, making a routed query row only ~1 KB, significantly smaller than the attended chunk. This research characterizes cross-instance MLA attention on a multi-node H100 cluster, demonstrating that routing the query is often cheaper. It distills a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, which tracks batched round-trips within ~7% on IBGDA. At decode, routing the query trades a ~3 ms cache re-adaptation for a tens-of-microsecond round trip, prioritizing fabric selection by probe latency over peak bandwidth. The model is applicable to architectures like DeepSeek-V3.2, V4, and GLM-5.1.

Key takeaway

For Machine Learning Engineers optimizing distributed LLM inference, consider routing queries instead of moving KV-cache blocks when using sparse attention. This approach, especially with Multi-head Latent Attention, can significantly reduce cross-instance latency from milliseconds to microseconds. Prioritize fabric selection based on probe latency, not peak bandwidth, to achieve optimal performance. You can adapt the proposed cost model by measuring just two coefficients for new architectures.

Key insights

Routing small queries is often cheaper than moving large KV-cache blocks across GPUs for sparse attention.

Principles

Method

Characterize cross-instance MLA attention on multi-node H100 clusters. Develop a topology-aware cost model (probe/transfer/compute/return/merge) and a closed-form route/fetch/local predicate, measuring constants on IBGDA.

In practice

Topics

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.