VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Human-Computer Interaction · Depth: Expert, quick

Summary

VideoFDB is introduced as the first benchmark designed to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, addressing a gap where existing benchmarks only assess speech. This new benchmark comprises 237 dyadic clips showcasing 11 nonverbal conversational dynamics from real-world video calls. It also provides a taxonomy separating perception from generation behaviors and a rubric-based LM-as-judge evaluation framework for assessing nonverbal conversational quality. Analysis of both open- and closed-source vision-speech agents using VideoFDB revealed systematic failures, including "captioning collapse" and "visual-stream ignorance." Current systems primarily use vision for explicit visual question answering rather than the streaming joint audiovisual grounding essential for natural conversation. Furthermore, cascaded speech-to-avatar systems were found to inherently lack the ability to produce full-duplex nonverbal cues.

Key takeaway

For AI Scientists and Machine Learning Engineers developing conversational agents, this benchmark highlights critical shortcomings in current full-duplex audio-visual capabilities. You should prioritize research and development into systems that can achieve true streaming joint audiovisual grounding, moving beyond explicit visual question answering. Avoid cascaded speech-to-avatar architectures if your goal is to produce natural, full-duplex nonverbal cues, as they fundamentally preclude this functionality.

Key insights

VideoFDB is the first benchmark to evaluate full-duplex audio-visual conversational agents, revealing current systems' limitations in nonverbal grounding.

Principles

Natural conversation is full-duplex and audio-visual.
Perception and generation behaviors require distinct evaluation.
Current agents fail at streaming joint audiovisual grounding.

Method

VideoFDB uses 237 dyadic clips with 11 nonverbal dynamics, a perception/generation taxonomy, and a rubric-based LM-as-judge framework to assess AV2AV conversational quality.

In practice

Use VideoFDB to benchmark AV2AV agent performance.
Focus development on streaming audiovisual grounding.
Avoid cascaded speech-to-avatar for full-duplex cues.

Topics

VideoFDB
Full-Duplex Conversation
Multimodal Conversational Agents
Audio-Visual Benchmarking
Nonverbal Communication
LM-as-Judge Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.