SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Summary
SocialGrid is a new embodied multi-agent environment, inspired by the game Among Us, designed to evaluate Large Language Model (LLM) agents on planning, task execution, and social reasoning. Initial evaluations using SocialGrid show that even the most powerful open model, GPT-OSS-120B, achieves less than 60% accuracy in task completion and planning, frequently exhibiting repetitive behaviors or navigation failures. To specifically assess social intelligence without confounding navigation issues, SocialGrid includes an optional Planning Oracle. While this oracle improves task completion, LLM agents still struggle significantly with social reasoning, failing to detect deception at near-random chance and relying on superficial heuristics rather than evidence accumulation. The platform offers automatic failure analysis, fine-grained metrics, and a competitive leaderboard based on Elo ratings from adversarial league play.
Key takeaway
For research scientists developing embodied LLM agents, you should prioritize improving core planning and navigation capabilities before expecting robust social intelligence. Your agents' current performance in deception detection is likely near-random, even with planning assistance, indicating a need to move beyond shallow heuristics. Utilize SocialGrid's fine-grained metrics and failure analysis to diagnose specific weaknesses in your agent's social reasoning and task execution.
Key insights
LLM agents struggle with planning, task execution, and social reasoning in embodied multi-agent environments.
Principles
- Poor navigation confounds social intelligence evaluation.
- Social reasoning remains a bottleneck for LLMs.
- LLMs rely on shallow heuristics for deception detection.
Method
SocialGrid evaluates LLM agents in an Among Us-inspired environment, offering a Planning Oracle to isolate social reasoning deficits and using Elo ratings for leaderboard competition.
In practice
- Use Planning Oracle to isolate social reasoning.
- Focus on improving deception detection in LLMs.
- Analyze agent failures with SocialGrid's metrics.
Topics
- SocialGrid
- Embodied Multi-Agent Systems
- LLM Agents
- Social Reasoning
- Planning Oracle
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.