Where does Absolute Position come from in decoder-only Transformers?
Summary
RoPE-trained decoder-only Transformers exhibit absolute position awareness despite Rotary Position Embeddings (RoPE) encoding only relative offsets. This study traces this leakage to two architectural elements: the causal mask, where the per-query softmax denominator inherently depends on absolute query position, and the residual stream. Specifically, the activation at position 0 functions as a closed dynamical system, with downstream "sink-reading" attention heads extracting its trajectory. The balance of these components varies across architectures; NTK scaling diminishes the residual-stream effect, while sliding-window attention enhances its accumulation with depth, with standard RoPE falling in between. Notably, replacing the "BOS" embedding before the forward pass eliminates 40% of the residual-stream component for early queries. Attention sinks act as token-anchored stabilizers, forwarding a deterministic fingerprint of the token at position 0.
Key takeaway
For Machine Learning Engineers debugging unexpected positional biases in RoPE-trained decoder-only Transformers, understanding that absolute position leaks via the causal mask and residual stream is critical. If you aim to reduce this implicit positional dependence, consider manipulating the "BOS" embedding, which can remove 40% of the residual-stream component for early queries. Your architectural choices, like NTK scaling or sliding-window attention, also significantly influence how absolute position information accumulates.
Key insights
RoPE-trained Transformers gain absolute position from the causal mask and residual stream's "attention sinks."
Principles
- Causal masks introduce absolute position via softmax denominators.
- Residual streams propagate initial token state via attention sinks.
- Architectural scaling impacts position information accumulation.
In practice
- Replacing "BOS" embedding reduces early query position leakage.
- NTK scaling mitigates residual-stream position influence.
- Sliding-window attention increases depth-wise position accumulation.
Topics
- RoPE
- Decoder-only Transformers
- Positional Encoding
- Causal Mask
- Residual Stream
- Attention Sinks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.