Dual Dimensionality for Local and Global Attention

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Dual Dimensionality for Local and Global Attention introduces Distance-Adaptive Representation (DAR), a novel approach for Transformer models that challenges the conventional uniform dimensionality of keys and values in the KV cache. DAR hypothesizes that local tokens, critical for immediate output prediction, require full-dimensional representations, while distant tokens, serving as long-range memory, can suffice with reduced-dimensional representations, such as 1/4 of the original. This method preserves full-dimensional representations within a local context window and assigns lower dimensions beyond it. Across multiple pretraining scales, from 70M to 410M parameters, and during supervised fine-tuning on a 1B-scale model, DAR closely matched the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions led to degraded performance. These findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity, potentially reducing KV cache size during inference.

Key takeaway

For Machine Learning Engineers optimizing Transformer inference, Distance-Adaptive Representation (DAR) offers a compelling strategy to reduce KV cache memory without sacrificing performance. You should consider implementing dual dimensionality, where local tokens retain full representation and distant tokens use reduced dimensions (e.g., 1/4). This approach, validated across various model scales, challenges uniform key/value sizing and can significantly lower inference costs for large language models.

Key insights

Distance-Adaptive Representation (DAR) uses dual dimensionality for local and global attention, matching performance while reducing KV cache.

Principles

Method

DAR preserves full-dimensional representations within a local context window, assigning reduced-dimensional representations (e.g., 1/4) to tokens beyond that window.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.