AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

2026-04-09 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

AVGen-Bench is a new task-driven benchmark designed for evaluating Text-to-Audio-Video (T2AV) generation models, addressing the limitations of existing fragmented evaluation methods. It features high-quality prompts categorized into 11 real-world scenarios. The benchmark introduces a multi-granular evaluation framework that integrates specialist models with Multimodal Large Language Models (MLLMs) to assess T2AV generation from perceptual quality to fine-grained semantic control. Initial evaluations using AVGen-Bench reveal a significant disparity between the strong audio-visual aesthetics of current T2AV models and their weak semantic reliability. Specific weaknesses include consistent failures in text rendering, speech coherence, physical reasoning, and a universal inability to control musical pitch.

Key takeaway

For research scientists developing or deploying Text-to-Audio-Video (T2AV) generation models, you should integrate AVGen-Bench into your evaluation pipeline. This benchmark will help you identify specific weaknesses in semantic reliability, such as text rendering, speech coherence, and musical pitch control, guiding your model improvements beyond mere aesthetic quality.

Key insights

AVGen-Bench offers a multi-granular evaluation framework for T2AV models, revealing a gap between aesthetics and semantic control.

Principles

T2AV evaluation needs multi-granular assessment.
Combine specialist models with MLLMs for comprehensive T2AV evaluation.

Method

AVGen-Bench uses 11 real-world prompt categories and a multi-granular framework combining specialist models with MLLMs to evaluate T2AV generation for perceptual quality and semantic controllability.

In practice

Use AVGen-Bench for T2AV model comparison.
Focus T2AV development on semantic reliability.

Topics

Text-to-Audio-Video Generation
AVGen-Bench
Multimodal Large Language Models
Benchmark Evaluation
Semantic Controllability

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.