We’ve Been Doing Attention Wrong (2-Line Fix)
Summary
Exclusive Self Attention (XSA) is a novel modification to the standard Transformer attention mechanism, addressing the "attention similarity bias" where attention outputs redundantly align with a token's own value vector. Standard attention mixes contextual and pointwise signals, forcing it to choose between modeling relationships and pointwise features. XSA resolves this by applying an orthogonal projection to the attention output, removing the component aligned with the token's self-value vector. This allows attention to focus exclusively on gathering contextual information from other tokens. The implementation requires only two lines of code, introduces minimal computational overhead, and consistently improves training and validation loss across model sizes (0.7B, 1.4B, 2.7B parameters). XSA also shows improved performance on downstream language understanding benchmarks, with benefits increasing for larger models and longer sequence lengths (up to 16,000 tokens), and demonstrates robustness across various learning rates.
Key takeaway
For AI Engineers optimizing Transformer models, XSA offers a straightforward, low-cost method to enhance performance. By adding just two lines of code, you can enable attention layers to focus purely on contextual information, leading to consistent gains in training efficiency and downstream task performance, especially with larger models and longer sequences, without requiring hyperparameter tuning.
Key insights
Exclusive Self Attention (XSA) improves Transformer performance by eliminating redundant self-value encoding in attention outputs.
Principles
- Attention should prioritize contextual information.
- Orthogonal projection can isolate contextual signals.
Method
XSA computes standard multi-head attention, then normalizes each value vector to unit length, and finally subtracts the attention output's projection onto the normalized self-value vector.
In practice
- Integrate two lines of code into existing Transformers.
- Apply XSA for improved performance in long sequence models.
Topics
- Standard Attention
- Exclusive Self Attention
- Orthogonal Projection
- Transformer Architecture
- Attention Similarity Bias
Best for: AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Jia-Bin Huang.