I wrote a paper on HoloKV: Using CDMA Phase-Shifting to achieve O(N/k) KV-Cache Compression. Looking for Triton/CUDA collaborators.

2026-05-12 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Hardware Acceleration · Depth: Expert, quick

Summary

Independent researcher Sami Al-Rfou has published an open research draft for HoloKV, a novel geometric approach designed to compress KV-caches for long-context Large Language Models (LLMs). HoloKV addresses the "Memory Wall" by multiplexing k tokens into a single physical memory slot, using deterministic +1/-1 orthogonal phase keys inspired by CDMA telecommunications to separate signals. The method incorporates Variance Normalization with a sqrt(k) penalty to prevent Softmax entropy collapse, a Strict Even-Boundary Rule to preserve the 2D rotary commutative math of RoPE (used in Llama/Qwen), and LoRA Denoising via Knowledge Distillation to filter out background static. While the mathematical simulator is proven in PyTorch, realizing the projected 75%+ VRAM savings requires a custom SRAM Active Accumulation Buffer implemented in OpenAI Triton or CUDA to avoid Read-Modify-Write penalties.

Key takeaway

For AI engineers and research scientists tackling LLM memory constraints, HoloKV presents a promising alternative to quantization or token eviction. If your team is exploring novel KV-cache compression techniques, consider reviewing the HoloKV paper and potentially contributing to its hardware kernel development to achieve significant VRAM savings and enable longer context windows without reasoning degradation.

Key insights

HoloKV uses CDMA-inspired phase-shifting to multiplex LLM KV-cache tokens, achieving significant memory compression.

Principles

Multiplex k tokens into one slot
Use orthogonal phase keys for separation
Preserve RoPE math with boundary rules

Method

HoloKV stacks tokens using orthogonal phase keys, applying variance normalization, a strict even-boundary rule for RoPE preservation, and LoRA denoising via knowledge distillation to filter noise.

In practice

Implement SRAM Active Accumulation Buffer
Utilize Triton or CUDA for kernel development

Topics

HoloKV
KV-Cache Compression
CDMA Phase-Shifting
Long-Context LLMs
RoPE

Code references

0sami0/HoloKV

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.