RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

RayPE is a novel positional-encoding extension designed for modern video diffusion transformers, addressing their lack of 3D scene structure awareness. It operates by additively injecting per-token 6D Plucker coordinates into the queries and keys of self-attention, drawing an analogy between the Plucker reciprocal product and Transformer attention's dot product. The system ensures stability across diverse camera-translation scales by decoupling ray direction from moment magnitude, gating the encoding with a learned log-magnitude function, and applying RMSNorm. This module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized, and significantly improves camera controllability, cross-frame 3D consistency, and overall video quality when trained on a four-dataset mixture.

Key takeaway

For Machine Learning Engineers developing 3D-aware video generation models, RayPE offers a direct method to inject crucial geometric information into transformer attention. This approach, which adds less than 0.1% parameters, can significantly improve cross-frame 3D consistency and camera controllability. Consider integrating RayPE into your video diffusion transformer architectures to achieve more coherent and controllable 3D video outputs.

Key insights

RayPE integrates 3D camera ray geometry into video diffusion transformer self-attention for enhanced consistency.

Principles

Method

RayPE injects 6D Plucker coordinates additively into query/key self-attention, using a query/key flip arrangement. It gates encoding by a learned log-magnitude function and applies RMSNorm for stability across varied camera translation scales.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.