Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Sony Group Corporation and Sony AI researchers introduce MMHNet, a novel multimodal hierarchical network designed for long-form video-to-audio (LV2A) generation. This model addresses the challenge of length generalization, enabling the creation of high-quality, contextually aligned audio for videos up to 5 minutes long, even when trained on short, 8-second clips. MMHNet integrates a hierarchical method and non-causal Mamba-2 architecture to overcome limitations of traditional Transformer-based models, which struggle with positional embeddings and long sequences. The framework employs hierarchical token routing and dynamic chunking to efficiently align multimodal inputs (video, text, audio) and reduce computational complexity. Experiments on UnAV100 and LongVale datasets demonstrate MMHNet's superior performance over existing state-of-the-art methods like LoVA and MMAudio, achieving significant improvements in multimodal alignment (IB-score) and temporal synchronization (DeSync scores) while generating 500 seconds of audio in approximately 60 seconds on an H100 GPU.

Key takeaway

Research Scientists developing video-to-audio generation systems should consider adopting MMHNet's architectural principles. Its use of non-causal Mamba-2 and hierarchical routing allows for robust length generalization, producing high-quality audio for videos exceeding 5 minutes from short-clip training. This approach offers a significant performance and efficiency advantage over traditional Transformer-based models, making it a strong candidate for applications requiring extended audio synthesis.

Key insights

MMHNet enables long-form video-to-audio generation by combining hierarchical networks with non-causal Mamba-2, overcoming Transformer limitations.

Principles

Positional embeddings hinder length generalization in Transformers.
Non-causal Mamba-2 supports efficient long-sequence processing.
Hierarchical routing reduces token redundancy for multimodal alignment.

Method

MMHNet uses flow matching with a multimodal Mamba-2 architecture, incorporating temporal and multimodal routing layers based on cosine similarity to select key tokens and dynamic chunking for efficient processing.

In practice

Train V2A models on short clips for long-form inference.
Utilize non-causal Mamba-2 to avoid positional embedding issues.
Implement hierarchical routing to manage token redundancy.

Topics

Long-form Video-to-Audio
Length Generalization
MMHNet
Non-Causal Mamba-2
Hierarchical Networks

Code references

black-forest-labs/flux

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.