SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

SocialGrid is a new embodied multi-agent benchmark designed to evaluate Large Language Models (LLMs) on spatial planning, task execution, and adversarial social reasoning. Inspired by "Among Us," the environment places LLM agents in a gridworld where "Crewmates" complete tasks while identifying hidden "Impostors" who sabotage the mission. Evaluations of models ranging from 14B to 120B parameters, including GPT-OSS-120B, Llama3.1-70B, and Qwen3-30B, reveal significant deficits. Even the strongest open model, GPT-OSS-120B, achieves below 60% accuracy in task completion and planning without assistance, often getting stuck in repetitive behaviors. SocialGrid includes an optional Planning Oracle to isolate social reasoning from navigation issues. While this oracle improves task completion, social reasoning remains a bottleneck, with agents performing near random chance (around 33%) in detecting deception, regardless of model scale or environmental complexity. Analysis shows agents rely on shallow heuristics rather than accumulating behavioral evidence.

Key takeaway

Research Scientists developing embodied LLM agents should prioritize fundamental improvements in spatial planning and robust social reasoning. Current models, even large ones like GPT-OSS-120B, exhibit severe limitations in navigation and deception detection. You must move beyond simple scaling and shallow heuristics, potentially exploring new architectural approaches or advanced reinforcement learning techniques, to enable agents to effectively integrate spatial and social intelligence for real-world deployment.

Key insights

LLMs struggle with embodied spatial planning and social reasoning, failing to detect deception even with navigation assistance.

Principles

Spatial planning is a fundamental bottleneck for LLM agents.
Social reasoning does not scale with LLM model size.
Shallow heuristics hinder effective deception detection.

Method

SocialGrid evaluates LLM agents in a customizable gridworld with task and voting phases, using a Planning Oracle to isolate social reasoning, and provides multi-dimensional metrics and failure analysis.

In practice

Use a Planning Oracle to bypass LLM navigation deficits.
Focus on improving behavioral evidence accumulation for social reasoning.
Implement automated failure analysis for agent diagnostics.

Topics

SocialGrid Benchmark
Embodied LLM Agents
Spatial Planning Deficits
Social Reasoning Failure
Deception Detection

Code references

ml-research/SocialGrid

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.