SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

SocialGrid is a new embodied multi-agent environment, inspired by the game Among Us, designed to evaluate Large Language Model (LLM) agents on planning, task execution, and social reasoning. Initial evaluations using SocialGrid show that even the most powerful open model, GPT-OSS-120B, achieves less than 60% accuracy in task completion and planning, frequently exhibiting repetitive behaviors or navigation failures. To specifically assess social intelligence without confounding navigation issues, SocialGrid includes an optional Planning Oracle. While this oracle improves task completion, LLM agents still struggle significantly with social reasoning, failing to detect deception at near-random chance and relying on superficial heuristics rather than evidence accumulation. The platform offers automatic failure analysis, fine-grained metrics, and a competitive leaderboard based on Elo ratings from adversarial league play.

Key takeaway

For research scientists developing embodied LLM agents, you should prioritize improving core planning and navigation capabilities before expecting robust social intelligence. Your agents' current performance in deception detection is likely near-random, even with planning assistance, indicating a need to move beyond shallow heuristics. Utilize SocialGrid's fine-grained metrics and failure analysis to diagnose specific weaknesses in your agent's social reasoning and task execution.

Key insights

LLM agents struggle with planning, task execution, and social reasoning in embodied multi-agent environments.

Principles

Method

SocialGrid evaluates LLM agents in an Among Us-inspired environment, offering a Planning Oracle to isolate social reasoning deficits and using Elo ratings for leaderboard competition.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.