A Unifying View of Attention Sinks: Two Algorithms, Two Solutions
Summary
Attention sinks, where attention concentrates on a single token in softmax transformers, can represent two distinct computational mechanisms: adaptive nop and broadcast. Adaptive nop involves a head suppressing updates by routing to a null token, while broadcast aggregates and redistributes global information. These mechanisms leave distinct traces; nop sinks exhibit negligible value norms, and broadcast sinks induce low-rank outputs. Proposed interventions like gating and registers implicitly target one mechanism, with gating assuming nop and registers assuming broadcast. Diagnostics applied to pretrained vision transformers reveal both mechanisms exist, transitioning from CLS tokens in early layers to patch tokens in deeper layers. Register tokens, designed for broadcast, are repurposed for nop, indicating that neither intervention alone is sufficient, and combining gating with registers offers complementary gains in stability and performance.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing transformer models, you should first diagnose the underlying attention sink mechanism (adaptive nop or broadcast) before applying interventions. Utilize the proposed diagnostics to identify these mechanisms, as interventions like gating and registers are mechanism-specific. Consider combining gating with registers for your vision transformer architectures to achieve complementary gains in model stability and performance, rather than relying on a single approach.
Key insights
Visually similar attention sink patterns in transformers can hide two distinct underlying computational mechanisms.
Principles
- Attention sinks can be adaptive nop or broadcast.
- Each mechanism leaves distinct computational traces.
- Effective intervention requires understanding the mechanism.
Method
The method involves formalizing distinct traces (negligible value norms for nop, low-rank outputs for broadcast) on synthetic tasks to derive practical diagnostics for identifying the underlying attention sink mechanism.
In practice
- Apply diagnostics to identify nop vs. broadcast sinks.
- Combine gating and registers for complementary gains.
- Analyze sink transitions from CLS to patch tokens.
Topics
- Attention Sinks
- Softmax Transformers
- Adaptive Nop
- Broadcast Mechanism
- Gating & Registers
- Model Interpretability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.