VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
Summary
VideoFDB is introduced as the first benchmark designed to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, addressing a gap where existing benchmarks only assess speech. This new benchmark comprises 237 dyadic clips showcasing 11 nonverbal conversational dynamics from real-world video calls. It also provides a taxonomy separating perception from generation behaviors and a rubric-based LM-as-judge evaluation framework for assessing nonverbal conversational quality. Analysis of both open- and closed-source vision-speech agents using VideoFDB revealed systematic failures, including "captioning collapse" and "visual-stream ignorance." Current systems primarily use vision for explicit visual question answering rather than the streaming joint audiovisual grounding essential for natural conversation. Furthermore, cascaded speech-to-avatar systems were found to inherently lack the ability to produce full-duplex nonverbal cues.
Key takeaway
For AI Scientists and Machine Learning Engineers developing conversational agents, this benchmark highlights critical shortcomings in current full-duplex audio-visual capabilities. You should prioritize research and development into systems that can achieve true streaming joint audiovisual grounding, moving beyond explicit visual question answering. Avoid cascaded speech-to-avatar architectures if your goal is to produce natural, full-duplex nonverbal cues, as they fundamentally preclude this functionality.
Key insights
VideoFDB is the first benchmark to evaluate full-duplex audio-visual conversational agents, revealing current systems' limitations in nonverbal grounding.
Principles
- Natural conversation is full-duplex and audio-visual.
- Perception and generation behaviors require distinct evaluation.
- Current agents fail at streaming joint audiovisual grounding.
Method
VideoFDB uses 237 dyadic clips with 11 nonverbal dynamics, a perception/generation taxonomy, and a rubric-based LM-as-judge framework to assess AV2AV conversational quality.
In practice
- Use VideoFDB to benchmark AV2AV agent performance.
- Focus development on streaming audiovisual grounding.
- Avoid cascaded speech-to-avatar for full-duplex cues.
Topics
- VideoFDB
- Full-Duplex Conversation
- Multimodal Conversational Agents
- Audio-Visual Benchmarking
- Nonverbal Communication
- LM-as-Judge Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.