Social Norm Reasoning in Multimodal Language Models: An Evaluation

2026-03-05 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

A new evaluation framework assesses the social norm reasoning capabilities of five Multimodal Large Language Models (MLLMs): GPT-4o, Gemini 2.0 Flash, Qwen-2.5VL (72B), Intern-VL3 (14B), and Meta LLaMa-4 Maverick. Researchers from the University of Otago evaluated these models on 30 text-based and 30 image-based stories, each depicting one of five social norms across six variants of adherence or violation, and compared their responses to human ground truth. The study found that MLLMs performed significantly better in text-based norm reasoning (average 95.33% accuracy) than in image-based reasoning (average 83.58% accuracy). GPT-4o consistently achieved the highest accuracy in both modalities (98.75% text, 92.5% image), followed by the free model Qwen-2.5VL (97.5% text, 85.41% image). All models struggled with complex "metanorms" involving multiple layers of reasoning.

Key takeaway

For AI Scientists developing socially intelligent agents, this research indicates that current MLLMs, particularly GPT-4o, offer robust norm reasoning from text but show reduced accuracy with visual inputs. You should design systems that either prioritize textual context for norm interpretation or incorporate advanced visual processing to handle complex social cues. Be aware that reasoning about metanorms remains a significant challenge, requiring further research or explicit rule encoding for critical applications.

Key insights

MLLMs demonstrate stronger social norm reasoning in text than images, with GPT-4o leading performance.

Principles

MLLMs excel at textual inference for social norms.
Visual understanding of social contexts remains a challenge for MLLMs.
Complex, multi-layered norms (metanorms) are difficult for MLLMs.

Method

Evaluated MLLMs using 30 text and 30 image stories across five norms and six adherence/violation variants, comparing eight question responses against human ground truth to assess norm reasoning.

In practice

Prioritize text-based inputs for MLLM norm reasoning tasks.
Consider GPT-4o for high-accuracy, multimodal norm reasoning.
Use Qwen-2.5VL as a strong open-source alternative.

Topics

Multimodal Large Language Models
Social Norm Reasoning
Multi-Agent Systems
GPT-4o Performance
Metanorms

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.