Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

2026-04-23 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

The "Reshoot-Anything" model introduces a self-supervised framework to address the scarcity of paired multi-view data for reshooting dynamic videos. It generates pseudo multi-view training triplets from internet-scale monocular videos by extracting distinct smooth random-walk crop trajectories to create source and target views. A geometric anchor is synthetically generated by forward-warping the source's first frame with a dense tracking field, simulating distorted point-cloud inputs. This independent cropping strategy forces the model to learn 4D spatiotemporal structures by routing and re-projecting textures across different times and viewpoints from the source to reconstruct the target. At inference, a minimally adapted diffusion transformer uses a 4D point-cloud derived anchor, achieving high temporal consistency, robust camera control, and high-fidelity novel view synthesis for complex dynamic scenes.

Key takeaway

For research scientists developing video synthesis or editing tools, this self-supervised approach offers a viable path to overcome multi-view data scarcity. You should explore generating synthetic multi-view data from existing monocular datasets to train robust models, potentially reducing reliance on costly paired capture setups and accelerating development of advanced video manipulation capabilities.

Key insights

Self-supervised learning from monocular videos enables robust video reshooting by generating pseudo multi-view data.

Principles

Leverage internet-scale monocular videos for training.
Synthesize geometric anchors to simulate inference inputs.

Method

Generate pseudo multi-view training triplets from single videos using random-walk crop trajectories for source/target views and forward-warped first frames for anchors.

In practice

Utilize monocular video for multi-view training.
Employ 4D point-cloud anchors for inference.

Topics

Self-Supervised Learning
Video Reshooting
Novel View Synthesis
4D Spatiotemporal Structures
Diffusion Transformer

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.