Do Transformers Need Three Projections? Systematic Study of QKV Variants

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

This study systematically evaluates three projection sharing constraints for Transformer attention: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Researchers found that these variants perform comparably to or better than the standard QKV transformer across synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly detection, segmentation), and language modeling (300M and 1.2B parameter models on 10B tokens). Specifically, Q-K=V projection sharing achieved a 50% KV cache reduction with only a 3.1% perplexity degradation in language modeling. This method is complementary to head sharing (GQA/MQA), with Q-K=V combined with GQA-4 yielding an 87.5% cache reduction and Q-K=V + MQA achieving 96.9% reduction, enabling practical on-device inference. The Q-K=V variant preserves quality because keys and values can occupy similar representational spaces.

Key takeaway

For AI Architects designing efficient Transformer deployments, consider integrating projection sharing, particularly the Q-K=V variant. This approach offers a 50% KV cache reduction with only 3.1% perplexity degradation, crucial for resource-constrained environments. Combining Q-K=V with Multi-Query Attention (MQA) can achieve up to 96.9% cache reduction, making billion-parameter models viable for on-device inference. Evaluate these combined strategies to optimize memory and throughput for your specific application.

Key insights

Unifying Transformer QKV projections, especially Q-K=V, significantly reduces KV cache size with minimal performance impact.

Principles

Method

Evaluate Q=K-V, Q-K=V, and Q=K=V projection sharing variants, optionally adding 2D positional encodings for asymmetry in non-causal tasks. Combine with GQA/MQA for compound gains.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.