Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Summary
Recent open-weight Large Language Model (LLM) releases, specifically Gemma 4, Laguna XS.2, ZAYA1-8B, and DeepSeek V4, demonstrate a strong focus on enhancing long-context efficiency through novel architectural modifications. Gemma 4 E2B and E4B models introduce KV sharing across layers, reducing KV cache size by approximately half (e.g., 2.7 GB for E2B at 128K contexts), and per-layer embeddings (PLE) to increase representational capacity without significantly expanding the transformer stack's computational cost. Laguna XS.2 employs layer-wise attention budgeting, varying query-head counts per layer (e.g., 6 for full attention, 8 for sliding window attention) to optimize attention capacity. ZAYA1-8B features Compressed Convolutional Attention (CCA), which performs attention directly in a compressed latent space with convolutional mixing, reducing both KV cache size and attention FLOPs. DeepSeek V4 integrates Manifold-Constrained Hyper-Connections (mHC) to widen residual pathways for increased expressiveness and a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for aggressive sequence-dimension compression, achieving significant reductions in inference FLOPs and KV cache size at 1M-token contexts.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM inference for long contexts, these architectural innovations highlight a shift towards specialized, complex designs. You should investigate integrating techniques like cross-layer KV sharing, per-layer embeddings, or compressed attention mechanisms (CCA, CSA/HCA) to significantly reduce memory footprint and computational costs. Be prepared for increased code complexity, but these targeted optimizations are crucial for scaling LLMs to agentic workflows and reasoning tasks.
Key insights
LLM architectures are evolving with complex, targeted tweaks to optimize long-context efficiency and reduce computational overhead.
Principles
- Reduce KV cache size for longer contexts.
- Optimize attention capacity layer-wise.
- Compress attention operations in latent space.
Method
Implement KV sharing, per-layer embeddings, layer-wise attention budgeting, compressed convolutional attention, and manifold-constrained hyper-connections to enhance LLM long-context efficiency and capacity.
In practice
- Use KV sharing to save 2.7 GB in Gemma 4 E2B.
- Apply per-layer query-head budgeting in attention.
- Employ CSA/HCA for 1M-token context efficiency.
Topics
- KV Cache Optimization
- Per-Layer Embeddings
- Layer-wise Attention Budgeting
- Compressed Convolutional Attention
- Manifold-Constrained Hyper-Connections
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Ahead of AI.