Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix

· Source: Netflix TechBlog - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, Computer Vision & Video Processing · Depth: Expert, long

Summary

Netflix has introduced two early research models, Vera and VOID, to enhance control in AI video editing, addressing common issues like unintended alterations and unnatural physics. Vera is a layered video diffusion model that generates specific edit layers and alpha mattes, preserving original footage outside edited regions. It was trained on a custom 486k-frame dataset at 832x480 resolution and employs a Mixture-of-Transformers architecture, with 1.3B and 14B parameter variants. Vera significantly outperforms existing baselines in content preservation, as validated by quantitative metrics and a human preference study involving 19 creative reviewers. VOID is a video inpainting model designed for physically plausible object and interaction deletion. It uses a two-pass inference pipeline with a VLM-based reasoning component to identify causally affected regions and is trained on synthetic counterfactual video pairs. VOID demonstrably maintains consistent scene dynamics and perceptual realism better than six baselines, with 64.8% preference in a user study with 25 reviewers.

Key takeaway

For Creative Technologists or Computer Vision Engineers evaluating AI tools for professional video post-production, Netflix's Vera and VOID models offer a significant shift toward controllable editing. You should consider these layered diffusion and physically-plausible inpainting approaches to avoid unintended alterations and maintain scene integrity. This research suggests prioritizing models that isolate edits and simulate realistic physics, potentially reducing manual rework and enhancing creative control in your workflows.

Key insights

Netflix's Vera and VOID models advance controllable AI video editing by isolating changes and ensuring physical plausibility.

Principles

Method

Vera uses layered diffusion with a Mixture-of-Transformers for isolated edits. VOID employs a two-pass VLM-guided pipeline for physically plausible object removal, refining with flow-warped noise.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Creative Technologist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Netflix TechBlog - Medium.