SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Summary
SocialGrid is a new embodied multi-agent benchmark designed to evaluate Large Language Models (LLMs) on spatial planning, task execution, and adversarial social reasoning. Inspired by "Among Us," the environment places LLM agents in a gridworld where "Crewmates" complete tasks while identifying hidden "Impostors" who sabotage the mission. Evaluations of models ranging from 14B to 120B parameters, including GPT-OSS-120B, Llama3.1-70B, and Qwen3-30B, reveal significant deficits. Even the strongest open model, GPT-OSS-120B, achieves below 60% accuracy in task completion and planning without assistance, often getting stuck in repetitive behaviors. SocialGrid includes an optional Planning Oracle to isolate social reasoning from navigation issues. While this oracle improves task completion, social reasoning remains a bottleneck, with agents performing near random chance (around 33%) in detecting deception, regardless of model scale or environmental complexity. Analysis shows agents rely on shallow heuristics rather than accumulating behavioral evidence.
Key takeaway
Research Scientists developing embodied LLM agents should prioritize fundamental improvements in spatial planning and robust social reasoning. Current models, even large ones like GPT-OSS-120B, exhibit severe limitations in navigation and deception detection. You must move beyond simple scaling and shallow heuristics, potentially exploring new architectural approaches or advanced reinforcement learning techniques, to enable agents to effectively integrate spatial and social intelligence for real-world deployment.
Key insights
LLMs struggle with embodied spatial planning and social reasoning, failing to detect deception even with navigation assistance.
Principles
- Spatial planning is a fundamental bottleneck for LLM agents.
- Social reasoning does not scale with LLM model size.
- Shallow heuristics hinder effective deception detection.
Method
SocialGrid evaluates LLM agents in a customizable gridworld with task and voting phases, using a Planning Oracle to isolate social reasoning, and provides multi-dimensional metrics and failure analysis.
In practice
- Use a Planning Oracle to bypass LLM navigation deficits.
- Focus on improving behavioral evidence accumulation for social reasoning.
- Implement automated failure analysis for agent diagnostics.
Topics
- SocialGrid Benchmark
- Embodied LLM Agents
- Spatial Planning Deficits
- Social Reasoning Failure
- Deception Detection
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.