Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix
Summary
Netflix has introduced two early research models, Vera and VOID, to enhance control in AI video editing, addressing common issues like unintended alterations and unnatural physics. Vera is a layered video diffusion model that generates specific edit layers and alpha mattes, preserving original footage outside edited regions. It was trained on a custom 486k-frame dataset at 832x480 resolution and employs a Mixture-of-Transformers architecture, with 1.3B and 14B parameter variants. Vera significantly outperforms existing baselines in content preservation, as validated by quantitative metrics and a human preference study involving 19 creative reviewers. VOID is a video inpainting model designed for physically plausible object and interaction deletion. It uses a two-pass inference pipeline with a VLM-based reasoning component to identify causally affected regions and is trained on synthetic counterfactual video pairs. VOID demonstrably maintains consistent scene dynamics and perceptual realism better than six baselines, with 64.8% preference in a user study with 25 reviewers.
Key takeaway
For Creative Technologists or Computer Vision Engineers evaluating AI tools for professional video post-production, Netflix's Vera and VOID models offer a significant shift toward controllable editing. You should consider these layered diffusion and physically-plausible inpainting approaches to avoid unintended alterations and maintain scene integrity. This research suggests prioritizing models that isolate edits and simulate realistic physics, potentially reducing manual rework and enhancing creative control in your workflows.
Key insights
Netflix's Vera and VOID models advance controllable AI video editing by isolating changes and ensuring physical plausibility.
Principles
- AI should serve creative intent, protecting and expanding choice.
- Isolate changes to preserve source footage integrity.
- Ensure physical plausibility in object removal and scene reconstruction.
Method
Vera uses layered diffusion with a Mixture-of-Transformers for isolated edits. VOID employs a two-pass VLM-guided pipeline for physically plausible object removal, refining with flow-warped noise.
In practice
- Add objects or change backgrounds without altering source.
- Remove objects while preserving scene physics.
- Reconstruct scenes as if objects were never present.
Topics
- AI Video Editing
- Diffusion Models
- Video Inpainting
- Content Preservation
- Physical Plausibility
- Netflix Research
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Creative Technologist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Netflix TechBlog - Medium.