RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation
Summary
RayPE is a novel positional-encoding extension designed for modern video diffusion transformers, addressing their lack of 3D scene structure awareness. It operates by additively injecting per-token 6D Plucker coordinates into the queries and keys of self-attention, drawing an analogy between the Plucker reciprocal product and Transformer attention's dot product. The system ensures stability across diverse camera-translation scales by decoupling ray direction from moment magnitude, gating the encoding with a learned log-magnitude function, and applying RMSNorm. This module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized, and significantly improves camera controllability, cross-frame 3D consistency, and overall video quality when trained on a four-dataset mixture.
Key takeaway
For Machine Learning Engineers developing 3D-aware video generation models, RayPE offers a direct method to inject crucial geometric information into transformer attention. This approach, which adds less than 0.1% parameters, can significantly improve cross-frame 3D consistency and camera controllability. Consider integrating RayPE into your video diffusion transformer architectures to achieve more coherent and controllable 3D video outputs.
Key insights
RayPE integrates 3D camera ray geometry into video diffusion transformer self-attention for enhanced consistency.
Principles
- Geometric ray relations are bilinear, like attention dot products.
- Additive 6D Plucker coordinate injection improves attention.
- Decouple ray direction from moment magnitude for stability.
Method
RayPE injects 6D Plucker coordinates additively into query/key self-attention, using a query/key flip arrangement. It gates encoding by a learned log-magnitude function and applies RMSNorm for stability across varied camera translation scales.
In practice
- Improve 3D consistency in generated videos.
- Enhance camera controllability in DiT models.
- Extend pretrained video DiTs with minimal parameter overhead.
Topics
- RayPE
- Video Diffusion Transformers
- Positional Encoding
- 3D Video Generation
- Plucker Coordinates
- Camera Geometry
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.