CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

2026-04-29 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

CompactAttention is a novel attention mechanism designed to accelerate chunked prefill for long-context large language models (LLMs) by decoupling KV selection from sparse-kernel execution. It addresses limitations of existing sparse attention methods, which are inefficient in the $Q\ll KV$ regime of chunked prefill, and query-subsampled methods like QUOKA, which can miss critical KV entries and incur copy overhead. CompactAttention converts 2D block-sparse masks into GQA-aware per-group KV block tables using Q-block union and intra-group union, enabling zero-copy paged attention. Evaluated on LLaMA-3.1-8B-Instruct and Qwen3-30B-A3B-Instruct-2507, CompactAttention maintains accuracy close to dense attention on RULER and LongBench V2 benchmarks while achieving up to 2.72\times attention speedup at 128K context length on H200 GPUs.

Key takeaway

For AI Engineers optimizing long-context LLM serving, CompactAttention offers a significant performance improvement. You should consider implementing its Block-Union KV Selection and zero-copy paged attention to achieve substantial speedups (up to 2.72\times) in chunked prefill without sacrificing accuracy, particularly for models like LLaMA-3.1-8B-Instruct and Qwen3-30B-A3B-Instruct-2507. This approach addresses key bottlenecks in sparse attention and token-level KV selection.

Key insights

Decoupling KV selection from execution enables efficient sparse attention for chunked LLM prefill.

Principles

Sparse attention needs efficient execution in $Q\ll KV$ regime.
Block-level KV selection avoids token-level copy overhead.
Zero-copy paged execution is more efficient than sparse kernels.

Method

CompactAttention converts 2D block-sparse masks into GQA-aware per-group KV block tables via Q-block and intra-group unions, then executes selected KV blocks in-place using a paged attention kernel.

In practice

Use lightweight block-sparse pattern search for selection.
Store KV cache in KV-head-major layout for zero-copy access.
Partition large GQA groups into subgroups for better sparsity.

Topics

Chunked Prefill
Sparse Attention
KV Cache Optimization
Block-Union KV Selection
Zero-Copy Paged Attention

Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, MLOps Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.