CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

CompactAttention is a novel attention mechanism designed to accelerate chunked prefill for long-context large language models (LLMs) by decoupling KV selection from sparse-kernel execution. It addresses limitations of existing sparse attention methods, which are inefficient in the $Q\ll KV$ regime of chunked prefill, and query-subsampled methods like QUOKA, which can miss critical KV entries and incur copy overhead. CompactAttention converts 2D block-sparse masks into GQA-aware per-group KV block tables using Q-block union and intra-group union, enabling zero-copy paged attention. Evaluated on LLaMA-3.1-8B-Instruct and Qwen3-30B-A3B-Instruct-2507, CompactAttention maintains accuracy close to dense attention on RULER and LongBench V2 benchmarks while achieving up to 2.72\times attention speedup at 128K context length on H200 GPUs.

Key takeaway

For AI Engineers optimizing long-context LLM serving, CompactAttention offers a significant performance improvement. You should consider implementing its Block-Union KV Selection and zero-copy paged attention to achieve substantial speedups (up to 2.72\times) in chunked prefill without sacrificing accuracy, particularly for models like LLaMA-3.1-8B-Instruct and Qwen3-30B-A3B-Instruct-2507. This approach addresses key bottlenecks in sparse attention and token-level KV selection.

Key insights

Decoupling KV selection from execution enables efficient sparse attention for chunked LLM prefill.

Principles

Method

CompactAttention converts 2D block-sparse masks into GQA-aware per-group KV block tables via Q-block and intra-group unions, then executes selected KV blocks in-place using a paged attention kernel.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, MLOps Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.