Elastic Attention Cores for Scalable Vision Transformers [R]

2026-05-13 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, short

Summary

A new research paper introduces "Elastic Attention Cores" as an alternative building block for Vision Transformers (ViTs), addressing the high computational cost of traditional dense (N^2) self-attention at high resolutions. This novel architecture employs a core-periphery block-sparse attention structure, scaling more efficiently as (2NC + C^2) for C core tokens. The model is trained using nested dropout, allowing for test-time elastic adjustments to inference cost. It achieves competitive dense and classification accuracy compared to DINOv3 and maintains stability across resolutions from 256 to 1024. A notable emergent behavior is observed where core-dense attention patterns evolve from isotropic (spherical) in early layers to increasingly semantically aligned deeper within the network.

Key takeaway

For research scientists developing or deploying Vision Transformers, consider integrating Elastic Attention Cores to significantly improve scalability and reduce computational costs at higher resolutions. This approach offers competitive accuracy and stable performance, while also providing an interpretable internal structure where attention patterns become semantically aligned, potentially simplifying model analysis and debugging.

Key insights

Elastic Attention Cores offer a scalable, efficient alternative to dense self-attention in Vision Transformers.

Principles

Core-periphery attention reduces computational complexity.
Nested dropout enables dynamic inference cost adjustment.
Bottleneck communication encourages semantic compression.

Method

The proposed method utilizes a core-periphery block-sparse attention structure with (2NC + C^2) scaling, trained via nested dropout for elastic inference cost adjustment, and initializes core tokens from a normal distribution.

In practice

Achieve competitive accuracy against DINOv3.
Maintain performance across resolutions (256-1024).
Reduce ViT inference cost at high resolutions.

Topics

Elastic Attention Cores
Vision Transformers
Block-sparse Attention
Nested Dropout
Self-Attention Scaling

Code references

alansong1322/VECA

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.