Elastic Attention Cores for Scalable Vision Transformers [R]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, short

Summary

A new research paper introduces "Elastic Attention Cores" as an alternative building block for Vision Transformers (ViTs), addressing the high computational cost of traditional dense (N^2) self-attention at high resolutions. This novel architecture employs a core-periphery block-sparse attention structure, scaling more efficiently as (2NC + C^2) for C core tokens. The model is trained using nested dropout, allowing for test-time elastic adjustments to inference cost. It achieves competitive dense and classification accuracy compared to DINOv3 and maintains stability across resolutions from 256 to 1024. A notable emergent behavior is observed where core-dense attention patterns evolve from isotropic (spherical) in early layers to increasingly semantically aligned deeper within the network.

Key takeaway

For research scientists developing or deploying Vision Transformers, consider integrating Elastic Attention Cores to significantly improve scalability and reduce computational costs at higher resolutions. This approach offers competitive accuracy and stable performance, while also providing an interpretable internal structure where attention patterns become semantically aligned, potentially simplifying model analysis and debugging.

Key insights

Elastic Attention Cores offer a scalable, efficient alternative to dense self-attention in Vision Transformers.

Principles

Method

The proposed method utilizes a core-periphery block-sparse attention structure with (2NC + C^2) scaling, trained via nested dropout for elastic inference cost adjustment, and initializes core tokens from a normal distribution.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.