Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Summary
Sony Group Corporation and Sony AI researchers introduce MMHNet, a novel multimodal hierarchical network designed for long-form video-to-audio (LV2A) generation. This model addresses the challenge of length generalization, enabling the creation of high-quality, contextually aligned audio for videos up to 5 minutes long, even when trained on short, 8-second clips. MMHNet integrates a hierarchical method and non-causal Mamba-2 architecture to overcome limitations of traditional Transformer-based models, which struggle with positional embeddings and long sequences. The framework employs hierarchical token routing and dynamic chunking to efficiently align multimodal inputs (video, text, audio) and reduce computational complexity. Experiments on UnAV100 and LongVale datasets demonstrate MMHNet's superior performance over existing state-of-the-art methods like LoVA and MMAudio, achieving significant improvements in multimodal alignment (IB-score) and temporal synchronization (DeSync scores) while generating 500 seconds of audio in approximately 60 seconds on an H100 GPU.
Key takeaway
Research Scientists developing video-to-audio generation systems should consider adopting MMHNet's architectural principles. Its use of non-causal Mamba-2 and hierarchical routing allows for robust length generalization, producing high-quality audio for videos exceeding 5 minutes from short-clip training. This approach offers a significant performance and efficiency advantage over traditional Transformer-based models, making it a strong candidate for applications requiring extended audio synthesis.
Key insights
MMHNet enables long-form video-to-audio generation by combining hierarchical networks with non-causal Mamba-2, overcoming Transformer limitations.
Principles
- Positional embeddings hinder length generalization in Transformers.
- Non-causal Mamba-2 supports efficient long-sequence processing.
- Hierarchical routing reduces token redundancy for multimodal alignment.
Method
MMHNet uses flow matching with a multimodal Mamba-2 architecture, incorporating temporal and multimodal routing layers based on cosine similarity to select key tokens and dynamic chunking for efficient processing.
In practice
- Train V2A models on short clips for long-form inference.
- Utilize non-causal Mamba-2 to avoid positional embedding issues.
- Implement hierarchical routing to manage token redundancy.
Topics
- Long-form Video-to-Audio
- Length Generalization
- MMHNet
- Non-Causal Mamba-2
- Hierarchical Networks
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.