Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

2026-03-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A study investigated whether five Large Language Models (LLMs) possess Theory of Mind (ToM) capabilities, specifically the ability to infer beliefs, intentions, and emotions from text, comparing their performance to human controls. The research utilized an adapted version of the text-based "Strange Stories" paradigm, a tool commonly used in human ToM research, requiring models to answer questions about story characters' mental states. Results indicated a significant performance disparity among the LLMs. Earlier and smaller models showed sensitivity to the quantity of inferential cues and vulnerability to distracting information. In contrast, GPT-4o achieved high accuracy and robustness, performing comparably to humans even under the most challenging test conditions, suggesting advanced social-cognitive reasoning capabilities.

Key takeaway

For research scientists evaluating LLM social intelligence, this study indicates that GPT-4o exhibits robust Theory of Mind capabilities, performing on par with humans in text-based scenarios. You should consider GPT-4o for applications demanding sophisticated inference of beliefs, intentions, and emotions, while recognizing that smaller models may struggle with distracting information or limited inferential cues.

Key insights

GPT-4o demonstrates human-comparable Theory of Mind capabilities in text-based evaluations, outperforming smaller LLMs.

Principles

LLM ToM capabilities vary significantly by model size and architecture.
Robust ToM requires handling irrelevant information and sparse cues.

Method

The study adapted the "Strange Stories" paradigm, a human ToM research tool, to evaluate LLMs' ability to infer character beliefs, intentions, and emotions from text-based scenarios.

In practice

Use GPT-4o for tasks requiring complex social-cognitive reasoning.
Be aware of cue sensitivity in smaller LLMs for ToM-related applications.

Topics

Large Language Models
Theory of Mind
GPT-4o
Cognitive Status
Natural Language Processing

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.