EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
Summary
EntityBench is a new benchmark designed to evaluate multi-shot video generation systems, specifically addressing the challenge of maintaining consistent characters, objects, and locations across long visual narratives. It comprises 140 episodes and 2,491 shots, derived from real narrative media, featuring explicit per-shot entity schedules. The benchmark categorizes consistency challenges into easy, medium, and hard tiers, with sequences up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, and 22 cross-shot objects, and recurrence gaps up to 48 shots. EntityBench includes a three-pillar evaluation suite that assesses intra-shot quality, prompt-following alignment, and cross-shot consistency, incorporating a fidelity gate to ensure accurate entity appearances are scored. As a baseline, EntityMem, a memory-augmented generation system, stores verified per-entity visual references in a persistent memory bank. Experiments indicate existing methods struggle with consistency over recurrence distance, while EntityMem significantly improves character fidelity (Cohen's d = +2.33) and presence.
Key takeaway
For research scientists developing multi-shot video generation models, you should integrate EntityBench into your evaluation pipeline to rigorously test cross-shot entity consistency. Your models will benefit from incorporating explicit per-entity memory mechanisms, similar to EntityMem, to address the observed degradation of consistency over longer recurrence distances and improve character fidelity.
Key insights
Maintaining entity consistency across long multi-shot video generation remains a significant challenge for current methods.
Principles
- Explicit entity schedules improve evaluation.
- Memory-augmented generation enhances consistency.
- Cross-shot consistency degrades with distance.
Method
EntityMem uses a persistent memory bank to store verified per-entity visual references before generation, improving cross-shot consistency in multi-shot video generation.
In practice
- Use EntityBench for multi-shot video evaluation.
- Implement memory banks for entity consistency.
- Prioritize long-range consistency in models.
Topics
- Multi-Shot Video Generation
- Entity-Consistent Generation
- EntityBench Benchmark
- EntityMem System
- Cross-Shot Consistency
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.