Where does Absolute Position come from in decoder-only Transformers?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

RoPE-trained decoder-only Transformers exhibit absolute position leakage in their attention patterns, despite RoPE's design encoding only relative offsets. This study identifies two primary architectural sources: the causal mask's per-query softmax denominator, which depends on absolute query position, and the residual stream, where position-varying activations flow into queries and keys. The position-0 trajectory, a closed dynamical system from the initial token embedding, contributes significantly to the residual-stream component, with replacing the BOS embedding reducing this component by 40% at early queries in Llama-3.2-3B. The balance between these components varies across architectures: NTK scaling (Qwen) suppresses the residual-stream component, sliding-window attention (Mistral) allows it to accumulate with depth, and standard RoPE (Llama) sits in between. Attention sinks are token-anchored stabilizers, passing a deterministic fingerprint of the position-0 token.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing Transformer architectures, understanding absolute position leakage is crucial. You should investigate the interplay between causal masking and the residual stream, particularly the position-0 token's influence, when designing or fine-tuning models with RoPE. Consider architectural choices like NTK scaling to mitigate residual-stream leakage, especially for models where precise relative positioning is paramount, or if you need to control the stability of attention sinks.

Key insights

RoPE-trained Transformers leak absolute position via causal masks and position-0 residual stream dynamics.

Principles

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.