Is Attention sink without Positional Encoding unavoidable? [D]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

A developer pre-training small Transformer-based models, specifically Encoder-Decoder and Cross-attention memory-only models, consistently observes "vertical hot lines" in attention heatmaps when Positional Encoding (PE) is removed from Self-attention or Cross-attention. This pattern suggests that every query vector attends to the same key tokens, indicating a lack of dynamic, query-conditioned attention. The issue persists even with regularization attempts to spread out attention and does not resolve over tens of thousands of training steps, with loss remaining unaffected initially. The developer questions the necessity of PE in cross-attention, especially when encoder and decoder hidden states already incorporate PE from their respective self-attention mechanisms. Proposed solutions like QKNorm and SoftPick did not resolve the problem.

Key takeaway

For AI Engineers and Research Scientists developing Transformer models for text generation, if you observe "vertical hot lines" in attention heatmaps, it strongly indicates an "attention sink" problem due to missing Positional Encoding (PE). Ensure PE is consistently applied across both self-attention and cross-attention layers, as its absence can prevent dynamic token-to-token relationships from forming, leading to arbitrary attention on common words.

Key insights

Removing Positional Encoding from Transformer attention mechanisms leads to "attention sink" behavior.

Principles

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.