Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation
Summary
AgentViSS is a new benchmark designed to evaluate visual social intelligence in multimodal social simulation, addressing a gap in existing text-based social-agent benchmarks. It assesses whether multimodal agents can effectively use visual cues, such as facial expressions, posture, and gaze, to guide interactions. The benchmark comprises 240 scenarios, 585 role instances, and 2,340 role-task instances, integrating aligned textual-visual evidence and structured role profiles. It features four distinct role-level tasks: expression, characteristic, interaction regulation, and interaction outcome. Initial evaluations of seven recent Multimodal Large Language Models (MLLMs) using both verbalized-vision and direct-vision approaches revealed that while role-specific expression and conflict handling are nearing saturation, interaction regulation and visually grounded outcome achievement remain substantially more challenging for these models. The code and dataset are publicly available.
Key takeaway
For AI Scientists and Machine Learning Engineers developing multimodal agents for social interaction, you should prioritize improving models' visual social intelligence beyond basic role enactment. Your current MLLMs likely struggle with complex interaction regulation and achieving visually grounded outcomes, as revealed by AgentViSS. Focus your research on enhancing agents' ability to interpret and respond to subtle visual cues for effective social management. Consider using the AgentViSS benchmark to guide your model development and evaluate progress in these challenging areas.
Key insights
Multimodal agents struggle with complex visual social cues for interaction regulation and outcome achievement.
Principles
- Social interaction requires both language and visual signals.
- Text-based benchmarks miss visual social intelligence.
- Local role enactment differs from interaction management.
Method
AgentViSS evaluates visual social intelligence using 240 scenarios, 585 role instances, and 2,340 role-task instances. It combines textual-visual evidence, role profiles, and four tasks: expression, characteristic, interaction regulation, and interaction outcome.
In practice
- Benchmark MLLMs on visual social intelligence.
- Focus MLLM development on interaction regulation.
- Use AgentViSS dataset for multimodal agent training.
Topics
- AgentViSS Benchmark
- Multimodal LLMs
- Visual Social Intelligence
- Social Simulation
- Agent Benchmarking
- Interaction Regulation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.