Aurora: Unified Video Editing with a Tool-Using Agent

2026-05-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Aurora is an agentic video editing framework that integrates a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. This framework addresses the common issue of underspecified user requests in video editing by mapping raw input to a structured edit plan, which includes resolving textual and visual ambiguities before video generation. While existing video editing models use a single diffusion transformer for various tasks like replacement, removal, and style transfer, they typically require pre-processed inputs. Aurora's VLM agent is trained using supervised data for comprehensive edit planning and reference-image selection, alongside preference pairs for refining tool use and instructions. The framework was evaluated on the new AgentEdit-Bench and two other video editing benchmarks, demonstrating improved performance over instruction-only baselines and showing that the VLM agent is transferable to other frozen video editing models.

Key takeaway

For research scientists developing video editing tools, Aurora demonstrates a critical advancement in handling real-world, underspecified user requests. You should consider integrating tool-augmented VLM agents into your diffusion-based editing pipelines to improve usability and reduce the need for pre-processing. This approach enhances the robustness and flexibility of video editing systems, making them more practical for diverse applications.

Key insights

Aurora unifies video editing by using a VLM agent to resolve underspecified user requests for a diffusion transformer.

Principles

Agentic VLM resolves input underspecification.
Unified diffusion transformers handle diverse edits.
Supervised data trains robust edit planning.

Method

Aurora trains a VLM agent with supervised data for edit planning and reference-image selection, plus preference pairs for tool use, to map raw user requests into structured edit plans for a unified video diffusion transformer.

In practice

Integrate VLM agents for complex user inputs.
Use preference pairs for tool use refinement.
Develop benchmarks for underspecified tasks.

Topics

Aurora Framework
Video Editing
Tool-Using Agents
Vision-Language Models
Diffusion Transformers

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.