DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Summary
DreamID-Omni is a unified framework designed for controllable human-centric audio-video generation, addressing the challenge of integrating tasks like reference-based audio-video generation (R2AV), video editing (RV2AV), and audio-driven video animation (RA2V) which are typically treated as separate objectives. The framework introduces a Symmetric Conditional Diffusion Transformer that incorporates diverse conditioning signals through a symmetric conditional injection scheme. To overcome identity-timbre binding failures and speaker confusion in multi-person scenarios, DreamID-Omni employs a Dual-Level Disentanglement strategy, utilizing Synchronized RoPE at the signal level and Structured Captions at the semantic level. Additionally, a Multi-Task Progressive Training scheme is used to regularize strongly-constrained tasks with weakly-constrained generative priors, preventing overfitting and harmonizing objectives. Experiments show DreamID-Omni achieves comprehensive performance across video, audio, and audio-visual consistency, surpassing leading proprietary commercial models.
Key takeaway
For AI Scientists and Computer Vision Engineers developing human-centric generative models, DreamID-Omni offers a robust, unified approach. Its dual-level disentanglement and multi-task training scheme provide superior control over identity and timbre in multi-person scenarios, outperforming existing isolated methods. Consider adopting its architectural principles to enhance consistency and reduce speaker confusion in your next-generation audio-visual synthesis projects, especially for complex human interactions.
Key insights
DreamID-Omni unifies human-centric audio-video generation with disentangled control over identity and timbre.
Principles
- Integrate heterogeneous conditioning signals symmetrically.
- Disentangle identity and timbre at dual levels.
- Regularize strongly-constrained tasks with weakly-constrained priors.
Method
DreamID-Omni uses a Symmetric Conditional Diffusion Transformer, Dual-Level Disentanglement (Synchronized RoPE, Structured Captions), and Multi-Task Progressive Training for unified human-centric audio-video generation.
In practice
- Generate reference-based audio-video (R2AV).
- Perform video editing (RV2AV).
- Animate video via audio (RA2V).
Topics
- Human-Centric Audio-Video Generation
- Diffusion Transformers
- Identity Disentanglement
- Multi-Task Learning
- Generative AI
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.