Dual Dimensionality for Local and Global Attention
Summary
Dual Dimensionality for Local and Global Attention introduces Distance-Adaptive Representation (DAR), a novel approach for Transformer models that challenges the conventional uniform dimensionality of keys and values in the KV cache. DAR hypothesizes that local tokens, critical for immediate output prediction, require full-dimensional representations, while distant tokens, serving as long-range memory, can suffice with reduced-dimensional representations, such as 1/4 of the original. This method preserves full-dimensional representations within a local context window and assigns lower dimensions beyond it. Across multiple pretraining scales, from 70M to 410M parameters, and during supervised fine-tuning on a 1B-scale model, DAR closely matched the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions led to degraded performance. These findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity, potentially reducing KV cache size during inference.
Key takeaway
For Machine Learning Engineers optimizing Transformer inference, Distance-Adaptive Representation (DAR) offers a compelling strategy to reduce KV cache memory without sacrificing performance. You should consider implementing dual dimensionality, where local tokens retain full representation and distant tokens use reduced dimensions (e.g., 1/4). This approach, validated across various model scales, challenges uniform key/value sizing and can significantly lower inference costs for large language models.
Key insights
Distance-Adaptive Representation (DAR) uses dual dimensionality for local and global attention, matching performance while reducing KV cache.
Principles
- Local tokens require richer representations.
- Distant tokens suffice with lower dimensions.
- Uniform KV dimensionality is not optimal.
Method
DAR preserves full-dimensional representations within a local context window, assigning reduced-dimensional representations (e.g., 1/4) to tokens beyond that window.
In practice
- Reduce KV cache size during inference.
- Design attention with adaptive capacity.
Topics
- Distance-Adaptive Representation
- Transformer Attention
- KV Cache Optimization
- Model Inference Efficiency
- Natural Language Processing
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.