Social Norm Reasoning in Multimodal Language Models: An Evaluation
Summary
A new evaluation framework assesses the social norm reasoning capabilities of five Multimodal Large Language Models (MLLMs): GPT-4o, Gemini 2.0 Flash, Qwen-2.5VL (72B), Intern-VL3 (14B), and Meta LLaMa-4 Maverick. Researchers from the University of Otago evaluated these models on 30 text-based and 30 image-based stories, each depicting one of five social norms across six variants of adherence or violation, and compared their responses to human ground truth. The study found that MLLMs performed significantly better in text-based norm reasoning (average 95.33% accuracy) than in image-based reasoning (average 83.58% accuracy). GPT-4o consistently achieved the highest accuracy in both modalities (98.75% text, 92.5% image), followed by the free model Qwen-2.5VL (97.5% text, 85.41% image). All models struggled with complex "metanorms" involving multiple layers of reasoning.
Key takeaway
For AI Scientists developing socially intelligent agents, this research indicates that current MLLMs, particularly GPT-4o, offer robust norm reasoning from text but show reduced accuracy with visual inputs. You should design systems that either prioritize textual context for norm interpretation or incorporate advanced visual processing to handle complex social cues. Be aware that reasoning about metanorms remains a significant challenge, requiring further research or explicit rule encoding for critical applications.
Key insights
MLLMs demonstrate stronger social norm reasoning in text than images, with GPT-4o leading performance.
Principles
- MLLMs excel at textual inference for social norms.
- Visual understanding of social contexts remains a challenge for MLLMs.
- Complex, multi-layered norms (metanorms) are difficult for MLLMs.
Method
Evaluated MLLMs using 30 text and 30 image stories across five norms and six adherence/violation variants, comparing eight question responses against human ground truth to assess norm reasoning.
In practice
- Prioritize text-based inputs for MLLM norm reasoning tasks.
- Consider GPT-4o for high-accuracy, multimodal norm reasoning.
- Use Qwen-2.5VL as a strong open-source alternative.
Topics
- Multimodal Large Language Models
- Social Norm Reasoning
- Multi-Agent Systems
- GPT-4o Performance
- Metanorms
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.