ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Argus is a novel Wan-based framework designed for subject-preserving video generation, addressing the limitations of single static identity references. It introduces Stacked Multi-View Identity Mosaic Injection (SMII), which transforms MLLM-selected identity evidence from images or videos into a 3*3 stacked mosaic. This mosaic is synchronized with the diffusion time and injected as read-only memory into Wan's token space, creating a dynamic identity distribution. The framework also incorporates an MLLM Identity Director for conflict resolution, along with no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance, all without requiring paired subject-video supervision. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, scoring 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. It also introduces HardID-Celeb, a new benchmark, and improves YawScore by 12.60 points and OccScore by 15.10 points on it, demonstrating the effectiveness of dynamic identity memory.

Key takeaway

For Computer Vision Engineers developing subject-preserving video generation models, you should move beyond static identity references. Implement dynamic, multi-view identity representations, like Argus's SMII, to ensure subject recognizability across diverse conditions. Consider integrating counterfactual self-supervision and new metrics like YawScore and OccScore to robustly evaluate and improve your models' performance against challenging viewpoint changes and occlusions.

Key insights

Dynamic identity memory and counterfactual self-supervision significantly enhance subject-preserving video generation robustness.

Principles

Method

SMII converts MLLM-selected identity evidence into a 3*3 stacked mosaic, injecting it as dynamic, negative-time read-only memory in Wan's token space.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.