The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A new study investigates two prevalent phenomena in Transformer language models: massive activations and attention sinks. Massive activations involve a small subset of tokens displaying extreme outlier values in specific channels, while attention sinks describe certain tokens attracting disproportionate attention mass irrespective of their semantic content. Previous research noted their frequent co-occurrence and involvement of identical tokens, but their functional roles and causal links were not well understood. Through systematic experimentation, the authors demonstrate that this co-occurrence is primarily an architectural artifact of contemporary Transformer designs. They found that massive activations function globally, creating nearly constant hidden representations that endure across layers, effectively acting as implicit model parameters. Attention sinks, conversely, operate locally, modulating attention outputs across heads and biasing individual heads towards short-range dependencies. The pre-norm configuration is identified as the critical design choice enabling their co-occurrence, with its ablation leading to their decoupling.

Key takeaway

For research scientists optimizing Transformer architectures, understanding the distinct roles of massive activations and attention sinks is crucial. Your design choices, particularly regarding pre-norm configurations, directly influence these phenomena. Consider experimenting with ablating pre-norm to decouple these effects and potentially improve model interpretability or efficiency.

Key insights

Massive activations and attention sinks in Transformers serve distinct functions, with their co-occurrence driven by pre-norm architecture.

Principles

Method

Systematic experiments were used to analyze the functional roles and causal relationships of massive activations and attention sinks in Transformer models.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.