Do Transformers Need Three Projections? Systematic Study of QKV Variants

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

This study systematically evaluates three projection sharing constraints for Transformer attention: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Researchers found that these variants perform comparably to or better than the standard QKV transformer across synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly detection, segmentation), and language modeling (300M and 1.2B parameter models on 10B tokens). Specifically, Q-K=V projection sharing achieved a 50% KV cache reduction with only a 3.1% perplexity degradation in language modeling. This method is complementary to head sharing (GQA/MQA), with Q-K=V combined with GQA-4 yielding an 87.5% cache reduction and Q-K=V + MQA achieving 96.9% reduction, enabling practical on-device inference. The Q-K=V variant preserves quality because keys and values can occupy similar representational spaces.

Key takeaway

For AI Architects designing efficient Transformer deployments, consider integrating projection sharing, particularly the Q-K=V variant. This approach offers a 50% KV cache reduction with only 3.1% perplexity degradation, crucial for resource-constrained environments. Combining Q-K=V with Multi-Query Attention (MQA) can achieve up to 96.9% cache reduction, making billion-parameter models viable for on-device inference. Evaluate these combined strategies to optimize memory and throughput for your specific application.

Key insights

Unifying Transformer QKV projections, especially Q-K=V, significantly reduces KV cache size with minimal performance impact.

Principles

Q-K=V preserves quality due to shared representational spaces.
Projection sharing is complementary to head sharing.
Cache reduction, not parameter reduction, drives practical benefits.

Method

Evaluate Q=K-V, Q-K=V, and Q=K=V projection sharing variants, optionally adding 2D positional encodings for asymmetry in non-causal tasks. Combine with GQA/MQA for compound gains.

In practice

Implement Q-K=V for 50% KV cache reduction.
Combine Q-K=V with MQA for 96.9% cache reduction.
Use 2D positional encodings for symmetric attention in vision tasks.

Topics

Transformer Architecture
QKV Projections
KV Cache Optimization
Grouped Query Attention
Multi-Query Attention
On-device Inference

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.