SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind

· Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Sotopia-ToM is a new multi-dimensional benchmarking framework designed to evaluate Large Language Model (LLM) agents' ability to manage information asymmetry and privacy in multi-party interactions. It features an environment supporting both public and private communication channels and includes 160 human-reviewed scenarios across eight industry sectors, each involving 3 to 5 agents with partitioned private knowledge and channel-dependent sharing policies. The framework employs a multi-dimensional evaluation suite to assess information sharing, detail seeking, coordination efficiency, and privacy protection, aggregated into a composite InfoMgmt metric. Empirical results across 6 LLM backbones and various prompting strategies (vanilla, CoT-privacy, and ToM-based interventions) show that even the largest high-reasoning model, GPT-5, achieves only a 62% InfoMgmt score, highlighting persistent deficiencies in information-seeking and privacy-aware decision-making. ToM-based interventions, like ToM-Coach, consistently improve the coordination-privacy balance, for example, reducing critical privacy violations on GPT-4o from 9.9% to 2.2% and increasing the InfoMgmt score from 15% to 40%.

Key takeaway

For research scientists and CTOs developing multi-agent LLM systems, this work indicates that current models, even GPT-5, significantly underperform in complex information management and privacy-aware coordination. You should prioritize research into strategic disclosure planning and advanced inquiry mechanisms, as these remain critical bottlenecks. Consider integrating ToM-based reasoning, specifically ToM-Coach or ToM-Belief, to enhance privacy protection and overall coordination, but be aware that fundamental limitations persist, particularly in proactive information seeking.

Key insights

LLM agents struggle with information management and privacy in multi-party interactions, even with advanced ToM prompting.

Principles

Method

Sotopia-ToM uses a multi-stage pipeline to generate 160 human-reviewed scenarios, an N-agent simulator with public/private channels, and a four-metric evaluation suite (DA, IA, CPV, EFF) aggregated into an InfoMgmt score.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.