AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Summary
AVGen-Bench is a new task-driven benchmark designed for evaluating Text-to-Audio-Video (T2AV) generation models, addressing the limitations of existing fragmented evaluation methods. It features high-quality prompts categorized into 11 real-world scenarios. The benchmark introduces a multi-granular evaluation framework that integrates specialist models with Multimodal Large Language Models (MLLMs) to assess T2AV generation from perceptual quality to fine-grained semantic control. Initial evaluations using AVGen-Bench reveal a significant disparity between the strong audio-visual aesthetics of current T2AV models and their weak semantic reliability. Specific weaknesses include consistent failures in text rendering, speech coherence, physical reasoning, and a universal inability to control musical pitch.
Key takeaway
For research scientists developing or deploying Text-to-Audio-Video (T2AV) generation models, you should integrate AVGen-Bench into your evaluation pipeline. This benchmark will help you identify specific weaknesses in semantic reliability, such as text rendering, speech coherence, and musical pitch control, guiding your model improvements beyond mere aesthetic quality.
Key insights
AVGen-Bench offers a multi-granular evaluation framework for T2AV models, revealing a gap between aesthetics and semantic control.
Principles
- T2AV evaluation needs multi-granular assessment.
- Combine specialist models with MLLMs for comprehensive T2AV evaluation.
Method
AVGen-Bench uses 11 real-world prompt categories and a multi-granular framework combining specialist models with MLLMs to evaluate T2AV generation for perceptual quality and semantic controllability.
In practice
- Use AVGen-Bench for T2AV model comparison.
- Focus T2AV development on semantic reliability.
Topics
- Text-to-Audio-Video Generation
- AVGen-Bench
- Multimodal Large Language Models
- Benchmark Evaluation
- Semantic Controllability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.