A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Attention sinks, where attention concentrates on a single token in softmax transformers, can represent two distinct computational mechanisms: adaptive nop and broadcast. Adaptive nop involves a head suppressing updates by routing to a null token, while broadcast aggregates and redistributes global information. These mechanisms leave distinct traces; nop sinks exhibit negligible value norms, and broadcast sinks induce low-rank outputs. Proposed interventions like gating and registers implicitly target one mechanism, with gating assuming nop and registers assuming broadcast. Diagnostics applied to pretrained vision transformers reveal both mechanisms exist, transitioning from CLS tokens in early layers to patch tokens in deeper layers. Register tokens, designed for broadcast, are repurposed for nop, indicating that neither intervention alone is sufficient, and combining gating with registers offers complementary gains in stability and performance.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing transformer models, you should first diagnose the underlying attention sink mechanism (adaptive nop or broadcast) before applying interventions. Utilize the proposed diagnostics to identify these mechanisms, as interventions like gating and registers are mechanism-specific. Consider combining gating with registers for your vision transformer architectures to achieve complementary gains in model stability and performance, rather than relying on a single approach.

Key insights

Visually similar attention sink patterns in transformers can hide two distinct underlying computational mechanisms.

Principles

Method

The method involves formalizing distinct traces (negligible value norms for nop, low-rank outputs for broadcast) on synthetic tasks to derive practical diagnostics for identifying the underlying attention sink mechanism.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.