SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

2026-01-02 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

AMD introduced SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs), detailed in a January 2, 2026 blog post by Huanxuan Liao et al. SparK addresses the KV cache bottleneck by targeting channel-level redundancy rather than just temporal compression. It employs a "prune-and-recover" strategy, identifying and pruning irrelevant KV entries at the channel level during prefill and dynamically restoring them during attention score computation. This approach reduces KV cache storage by over 30% compared to traditional eviction methods while maintaining or improving model accuracy, even with an 80% pruning ratio resulting in less than 5% performance degradation. SparK is co-designed with the AMD ROCm™ software stack and demonstrated on AMD Instinct™ MI250 Accelerators, showing robust performance on LLaMA-3-8B-Instruct at LongBench.

Key takeaway

For AI engineers optimizing LLM inference on AMD Instinct™ GPUs, SparK offers a significant memory efficiency improvement. You should consider integrating SparK into your workflow, especially for long-context scenarios, as it can reduce KV cache storage by over 30% without retraining and is compatible with existing compression techniques. This allows for processing longer sequences within current memory budgets while preserving model accuracy.

Key insights

SparK uses query-aware unstructured channel pruning with dynamic recovery to compress LLM KV caches.

Principles

Channel saliency varies across queries and positions.
Unstructured sparsity can be dynamically recovered.

Method

SparK computes channel-wise saliency scores, applies unstructured pruning during prefill, and reconstructs pruned channels by sampling from cached distributions during decoding for full attention.

In practice

Integrates with existing quantization and token-eviction methods.
Reduces KV cache storage by over 30%.
Maintains accuracy with 80% pruning ratio.

Topics

KV Cache Compression
Unstructured Sparsity
Large Language Models
Channel Pruning
AMD Instinct GPUs

Code references

Best for: AI Engineer, NLP Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.