OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, extended

Summary

OmniDirector is a novel video generation framework designed for general multi-shot camera cloning without requiring cross-paired data. It introduces the "camera grid," a visual representation that encodes camera parameters as grid motion videos within an empty 3D scene, enabling unified handling of diverse camera motions for single or multi-shot sequences. This framework, trained on a million-scale camera grid-video dataset (1.8M internet videos, resized to 480p, 10k steps, 5e-5 learning rate, batch size 64), integrates with multimodal diffusion transformers. OmniDirector also features a hierarchical Prompt Expansion (PE) Agent for harmoniously combining camera motion, character, and action signals during inference. Experiments show superior performance, including a 39.3% improvement in translation precision (T-Pre) over CamCloneMaster, and robust generalization for complex cinematographic techniques like Hitchcock zoom.

Key takeaway

For AI Engineers developing advanced video generation systems, OmniDirector offers a robust solution for precise multi-shot camera control. You should consider adopting its "camera grid" representation and hierarchical prompt expansion agent to overcome data scarcity issues and achieve superior control over complex camera trajectories and shot transitions, significantly reducing content leakage compared to existing methods. This approach enables more intuitive and accurate director-level control in your generative models.

Key insights

OmniDirector uses a visual camera grid and hierarchical prompt expansion to enable precise multi-shot camera cloning in video generation.

Principles

Decouple camera motion from content using an empty 3D scene grid.
Integrate diverse control signals hierarchically for semantic coherence.
Apply adaptive Classifier-Free Guidance for global spatial structure.

Method

OmniDirector extracts camera parameters from reference videos, renders them as a "camera grid" video in an empty 3D scene, and trains a Multi-Modal Diffusion Transformer on million-scale grid-video pairs. Inference uses a hierarchical Prompt Expansion Agent for multi-signal integration.

In practice

Generate camera motion from raw RGB or Canny edge videos.
Clone complex effects like fisheye distortion or dolly zoom.

Topics

Video Generation
Camera Control
Diffusion Transformers
Multi-Shot Video
Camera Grid
Prompt Engineering

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.