Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

IAMFlow is a novel, training-free identity-aware memory framework designed to improve long-term consistency and mitigate memory degradation in autoregressive video generation. It addresses issues like identity drift, character duplication, and attribute loss that arise from evolving prompts and shifting entity references. The framework utilizes a Large Language Model (LLM) to extract entities and visual attributes from prompts, assigning unique global IDs for identity-aware memory. Concurrently, a Vision-Language Model (VLM) asynchronously verifies and refines these attributes from rendered frames, enabling explicit entity tracking. IAMFlow also incorporates an inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, achieving faster generation than existing baselines. The authors introduce NarraStream-Bench, a new benchmark with 324 multi-prompt scripts and a three-dimensional evaluation protocol, on which IAMFlow outperforms the strongest baseline by 2.56 points and achieves a 1.39\times speedup in 60-second multi-prompt settings.

Key takeaway

For research scientists developing narrative video generation systems, IAMFlow demonstrates that explicit identity tracking via LLM-VLM integration significantly improves long-term consistency and reduces identity drift without requiring retraining. You should consider adopting similar training-free, identity-aware memory frameworks and asynchronous verification pipelines to enhance the fidelity and efficiency of your generative models, especially when dealing with complex, evolving prompts.

Key insights

IAMFlow uses LLMs and VLMs for explicit identity tracking to enhance long-term consistency in narrative video generation.

Principles

Method

An LLM extracts entities and assigns global IDs from prompts, while a VLM asynchronously verifies attributes from frames, enabling explicit entity tracking for consistent video generation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.