Do Transformers Need Three Projections? Systematic Study of QKV Variants
Summary
This study systematically evaluates three projection sharing constraints for Transformer attention: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Researchers found that these variants perform comparably to or better than the standard QKV transformer across synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly detection, segmentation), and language modeling (300M and 1.2B parameter models on 10B tokens). Specifically, Q-K=V projection sharing achieved a 50% KV cache reduction with only a 3.1% perplexity degradation in language modeling. This method is complementary to head sharing (GQA/MQA), with Q-K=V combined with GQA-4 yielding an 87.5% cache reduction and Q-K=V + MQA achieving 96.9% reduction, enabling practical on-device inference. The Q-K=V variant preserves quality because keys and values can occupy similar representational spaces.
Key takeaway
For AI Architects designing efficient Transformer deployments, consider integrating projection sharing, particularly the Q-K=V variant. This approach offers a 50% KV cache reduction with only 3.1% perplexity degradation, crucial for resource-constrained environments. Combining Q-K=V with Multi-Query Attention (MQA) can achieve up to 96.9% cache reduction, making billion-parameter models viable for on-device inference. Evaluate these combined strategies to optimize memory and throughput for your specific application.
Key insights
Unifying Transformer QKV projections, especially Q-K=V, significantly reduces KV cache size with minimal performance impact.
Principles
- Q-K=V preserves quality due to shared representational spaces.
- Projection sharing is complementary to head sharing.
- Cache reduction, not parameter reduction, drives practical benefits.
Method
Evaluate Q=K-V, Q-K=V, and Q=K=V projection sharing variants, optionally adding 2D positional encodings for asymmetry in non-causal tasks. Combine with GQA/MQA for compound gains.
In practice
- Implement Q-K=V for 50% KV cache reduction.
- Combine Q-K=V with MQA for 96.9% cache reduction.
- Use 2D positional encodings for symmetric attention in vision tasks.
Topics
- Transformer Architecture
- QKV Projections
- KV Cache Optimization
- Grouped Query Attention
- Multi-Query Attention
- On-device Inference
Code references
- anushamadan02/Do-Transformers-Need-3-Projections
- anushamadan02/Do-Transformers-Need-3-Projections0
- anushamadan02/Do-Transformers-Need-3-Projections1
- anushamadan02/Do-Transformers-Need-3-Projections2
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.