Elastic Attention Cores for Scalable Vision Transformers [R]
Summary
A new research paper introduces "Elastic Attention Cores" as an alternative building block for Vision Transformers (ViTs), addressing the high computational cost of traditional dense (N^2) self-attention at high resolutions. This novel architecture employs a core-periphery block-sparse attention structure, scaling more efficiently as (2NC + C^2) for C core tokens. The model is trained using nested dropout, allowing for test-time elastic adjustments to inference cost. It achieves competitive dense and classification accuracy compared to DINOv3 and maintains stability across resolutions from 256 to 1024. A notable emergent behavior is observed where core-dense attention patterns evolve from isotropic (spherical) in early layers to increasingly semantically aligned deeper within the network.
Key takeaway
For research scientists developing or deploying Vision Transformers, consider integrating Elastic Attention Cores to significantly improve scalability and reduce computational costs at higher resolutions. This approach offers competitive accuracy and stable performance, while also providing an interpretable internal structure where attention patterns become semantically aligned, potentially simplifying model analysis and debugging.
Key insights
Elastic Attention Cores offer a scalable, efficient alternative to dense self-attention in Vision Transformers.
Principles
- Core-periphery attention reduces computational complexity.
- Nested dropout enables dynamic inference cost adjustment.
- Bottleneck communication encourages semantic compression.
Method
The proposed method utilizes a core-periphery block-sparse attention structure with (2NC + C^2) scaling, trained via nested dropout for elastic inference cost adjustment, and initializes core tokens from a normal distribution.
In practice
- Achieve competitive accuracy against DINOv3.
- Maintain performance across resolutions (256-1024).
- Reduce ViT inference cost at high resolutions.
Topics
- Elastic Attention Cores
- Vision Transformers
- Block-sparse Attention
- Nested Dropout
- Self-Attention Scaling
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.