Co-Director: Agentic Generative Video Storytelling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Content Creation & Production · Depth: Expert, extended

Summary

Co-Director is a hierarchical multi-agent framework designed to automate high-fidelity video storytelling, addressing challenges like semantic drift and cascading failures common in existing agentic pipelines. It formalizes video storytelling as a global optimization problem, employing a Multi-Armed Bandit (MAB) for top-down creative direction and a local multimodal self-refinement loop to ensure sequence-level consistency and mitigate identity drift. The system balances narrative exploration with effective creative configuration exploitation. For evaluation, the researchers introduce GenAd-Bench, a dataset of 400 fictional product scenarios for personalized advertising, which includes a multimodal LLM-based evaluation suite covering Visual Asset Fidelity, Demographic Alignment, Visual Quality, and Marketing Appeal. Experiments show Co-Director significantly outperforms state-of-the-art baselines, demonstrating a principled approach generalizable to broader cinematic narratives.

Key takeaway

For Computer Vision Engineers and Research Scientists developing generative video systems, Co-Director's hierarchical multi-agent architecture and MAB-driven global optimization offer a robust framework to achieve narrative coherence and visual consistency. You should consider integrating similar top-down creative steering and iterative self-refinement loops to overcome semantic drift and cascading failures, especially when building systems for complex, constraint-driven applications like advertising or cinematic content.

Key insights

Co-Director uses hierarchical multi-agent optimization and MAB-driven steering for coherent, high-fidelity video storytelling.

Principles

Method

Co-Director employs a hierarchical multi-agent system with an Orchestrator Agent using MAB for global creative direction, Pre-Production and Production Agents for content generation, and a Post-Production Agent for final assembly. Local MLLM-driven self-refinement loops correct intermediate artifacts.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.