RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

2026-06-25 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

RayPE is a novel positional-encoding extension designed for modern video diffusion transformers, addressing their lack of 3D scene structure awareness. It operates by additively injecting per-token 6D Plucker coordinates into the queries and keys of self-attention, drawing an analogy between the Plucker reciprocal product and Transformer attention's dot product. The system ensures stability across diverse camera-translation scales by decoupling ray direction from moment magnitude, gating the encoding with a learned log-magnitude function, and applying RMSNorm. This module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized, and significantly improves camera controllability, cross-frame 3D consistency, and overall video quality when trained on a four-dataset mixture.

Key takeaway

For Machine Learning Engineers developing 3D-aware video generation models, RayPE offers a direct method to inject crucial geometric information into transformer attention. This approach, which adds less than 0.1% parameters, can significantly improve cross-frame 3D consistency and camera controllability. Consider integrating RayPE into your video diffusion transformer architectures to achieve more coherent and controllable 3D video outputs.

Key insights

RayPE integrates 3D camera ray geometry into video diffusion transformer self-attention for enhanced consistency.

Principles

Geometric ray relations are bilinear, like attention dot products.
Additive 6D Plucker coordinate injection improves attention.
Decouple ray direction from moment magnitude for stability.

Method

RayPE injects 6D Plucker coordinates additively into query/key self-attention, using a query/key flip arrangement. It gates encoding by a learned log-magnitude function and applies RMSNorm for stability across varied camera translation scales.

In practice

Improve 3D consistency in generated videos.
Enhance camera controllability in DiT models.
Extend pretrained video DiTs with minimal parameter overhead.

Topics

RayPE
Video Diffusion Transformers
Positional Encoding
3D Video Generation
Plucker Coordinates
Camera Geometry

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.