RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

2026-06-09 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

RoVE (Rotary Value Embeddings) is a parameter-free modification addressing the limitation of Rotary Position Embeddings (RoPE), which make attention scores position-relative but leave the value pathway position-blind. RoVE makes values position-sensitive by rotating them simultaneously with keys, effectively turning RoPE attention into attentive convolution. This new perspective unifies previously independent formulations found in computer vision, robotics, and modern LLM architectures. Empirical evaluations on trained 124M and 354M GPT-2 models demonstrate consistent gains over RoPE. These improvements are observed across few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the most significant benefits appearing in tasks demanding long-range aggregation.

Key takeaway

For Machine Learning Engineers optimizing large language models for long-context tasks, RoVE offers a parameter-free upgrade to existing RoPE implementations. You should consider integrating RoVE to achieve consistent empirical gains in few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, particularly for applications requiring robust long-range aggregation. This modification unifies attention mechanisms across various domains, suggesting a broadly applicable enhancement.

Key insights

RoVE enhances RoPE by making attention values position-sensitive, improving LLM performance on long-range tasks.

Principles

Value pathways can be position-sensitive.
Rotating values with keys enables position-sensitivity.
Attentive convolution unifies diverse architectures.

Method

RoVE modifies RoPE by rotating value embeddings concurrently with key embeddings, making the value pathway position-sensitive without adding parameters. This transforms RoPE attention into attentive convolution.

In practice

Improve GPT-2 performance on long contexts.
Enhance few-shot in-context learning.
Boost out-of-distribution perplexity.

Topics

RoVE
Rotary Position Embeddings
Attention Mechanisms
Large Language Models
In-context Learning
Long-Context Retrieval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.