Where does Absolute Position come from in decoder-only Transformers?

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

RoPE-trained decoder-only Transformers exhibit absolute position awareness despite Rotary Position Embeddings (RoPE) encoding only relative offsets. This study traces this leakage to two architectural elements: the causal mask, where the per-query softmax denominator inherently depends on absolute query position, and the residual stream. Specifically, the activation at position 0 functions as a closed dynamical system, with downstream "sink-reading" attention heads extracting its trajectory. The balance of these components varies across architectures; NTK scaling diminishes the residual-stream effect, while sliding-window attention enhances its accumulation with depth, with standard RoPE falling in between. Notably, replacing the "BOS" embedding before the forward pass eliminates 40% of the residual-stream component for early queries. Attention sinks act as token-anchored stabilizers, forwarding a deterministic fingerprint of the token at position 0.

Key takeaway

For Machine Learning Engineers debugging unexpected positional biases in RoPE-trained decoder-only Transformers, understanding that absolute position leaks via the causal mask and residual stream is critical. If you aim to reduce this implicit positional dependence, consider manipulating the "BOS" embedding, which can remove 40% of the residual-stream component for early queries. Your architectural choices, like NTK scaling or sliding-window attention, also significantly influence how absolute position information accumulates.

Key insights

RoPE-trained Transformers gain absolute position from the causal mask and residual stream's "attention sinks."

Principles

Causal masks introduce absolute position via softmax denominators.
Residual streams propagate initial token state via attention sinks.
Architectural scaling impacts position information accumulation.

In practice

Replacing "BOS" embedding reduces early query position leakage.
NTK scaling mitigates residual-stream position influence.
Sliding-window attention increases depth-wise position accumulation.

Topics

RoPE
Decoder-only Transformers
Positional Encoding
Causal Mask
Residual Stream
Attention Sinks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.