DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

DreamID-Omni is a unified framework designed for controllable human-centric audio-video generation, addressing the challenge of integrating tasks like reference-based audio-video generation (R2AV), video editing (RV2AV), and audio-driven video animation (RA2V) which are typically treated as separate objectives. The framework introduces a Symmetric Conditional Diffusion Transformer that incorporates diverse conditioning signals through a symmetric conditional injection scheme. To overcome identity-timbre binding failures and speaker confusion in multi-person scenarios, DreamID-Omni employs a Dual-Level Disentanglement strategy, utilizing Synchronized RoPE at the signal level and Structured Captions at the semantic level. Additionally, a Multi-Task Progressive Training scheme is used to regularize strongly-constrained tasks with weakly-constrained generative priors, preventing overfitting and harmonizing objectives. Experiments show DreamID-Omni achieves comprehensive performance across video, audio, and audio-visual consistency, surpassing leading proprietary commercial models.

Key takeaway

For AI Scientists and Computer Vision Engineers developing human-centric generative models, DreamID-Omni offers a robust, unified approach. Its dual-level disentanglement and multi-task training scheme provide superior control over identity and timbre in multi-person scenarios, outperforming existing isolated methods. Consider adopting its architectural principles to enhance consistency and reduce speaker confusion in your next-generation audio-visual synthesis projects, especially for complex human interactions.

Key insights

DreamID-Omni unifies human-centric audio-video generation with disentangled control over identity and timbre.

Principles

Method

DreamID-Omni uses a Symmetric Conditional Diffusion Transformer, Dual-Level Disentanglement (Synchronized RoPE, Structured Captions), and Multi-Task Progressive Training for unified human-centric audio-video generation.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.