Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

2025-07-19 · Source: Ahead of AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Recent open-weight Large Language Model (LLM) releases, specifically Gemma 4, Laguna XS.2, ZAYA1-8B, and DeepSeek V4, demonstrate a strong focus on enhancing long-context efficiency through novel architectural modifications. Gemma 4 E2B and E4B models introduce KV sharing across layers, reducing KV cache size by approximately half (e.g., 2.7 GB for E2B at 128K contexts), and per-layer embeddings (PLE) to increase representational capacity without significantly expanding the transformer stack's computational cost. Laguna XS.2 employs layer-wise attention budgeting, varying query-head counts per layer (e.g., 6 for full attention, 8 for sliding window attention) to optimize attention capacity. ZAYA1-8B features Compressed Convolutional Attention (CCA), which performs attention directly in a compressed latent space with convolutional mixing, reducing both KV cache size and attention FLOPs. DeepSeek V4 integrates Manifold-Constrained Hyper-Connections (mHC) to widen residual pathways for increased expressiveness and a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for aggressive sequence-dimension compression, achieving significant reductions in inference FLOPs and KV cache size at 1M-token contexts.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM inference for long contexts, these architectural innovations highlight a shift towards specialized, complex designs. You should investigate integrating techniques like cross-layer KV sharing, per-layer embeddings, or compressed attention mechanisms (CCA, CSA/HCA) to significantly reduce memory footprint and computational costs. Be prepared for increased code complexity, but these targeted optimizations are crucial for scaling LLMs to agentic workflows and reasoning tasks.

Key insights

LLM architectures are evolving with complex, targeted tweaks to optimize long-context efficiency and reduce computational overhead.

Principles

Reduce KV cache size for longer contexts.
Optimize attention capacity layer-wise.
Compress attention operations in latent space.

Method

Implement KV sharing, per-layer embeddings, layer-wise attention budgeting, compressed convolutional attention, and manifold-constrained hyper-connections to enhance LLM long-context efficiency and capacity.

In practice

Use KV sharing to save 2.7 GB in Gemma 4 E2B.
Apply per-layer query-head budgeting in attention.
Employ CSA/HCA for 1M-token context efficiency.

Topics

KV Cache Optimization
Per-Layer Embeddings
Layer-wise Attention Budgeting
Compressed Convolutional Attention
Manifold-Constrained Hyper-Connections

Code references

rasbt/LLMs-from-scratch

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ahead of AI.