OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, extended

Summary

OmniDirector is a novel video generation framework designed for general multi-shot camera cloning without requiring cross-paired data. It introduces the "camera grid," a visual representation that encodes camera parameters as grid motion videos within an empty 3D scene, enabling unified handling of diverse camera motions for single or multi-shot sequences. This framework, trained on a million-scale camera grid-video dataset (1.8M internet videos, resized to 480p, 10k steps, 5e-5 learning rate, batch size 64), integrates with multimodal diffusion transformers. OmniDirector also features a hierarchical Prompt Expansion (PE) Agent for harmoniously combining camera motion, character, and action signals during inference. Experiments show superior performance, including a 39.3% improvement in translation precision (T-Pre) over CamCloneMaster, and robust generalization for complex cinematographic techniques like Hitchcock zoom.

Key takeaway

For AI Engineers developing advanced video generation systems, OmniDirector offers a robust solution for precise multi-shot camera control. You should consider adopting its "camera grid" representation and hierarchical prompt expansion agent to overcome data scarcity issues and achieve superior control over complex camera trajectories and shot transitions, significantly reducing content leakage compared to existing methods. This approach enables more intuitive and accurate director-level control in your generative models.

Key insights

OmniDirector uses a visual camera grid and hierarchical prompt expansion to enable precise multi-shot camera cloning in video generation.

Principles

Method

OmniDirector extracts camera parameters from reference videos, renders them as a "camera grid" video in an empty 3D scene, and trains a Multi-Modal Diffusion Transformer on million-scale grid-video pairs. Inference uses a hierarchical Prompt Expansion Agent for multi-signal integration.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.