Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers introduce "Instruction Inference," a novel task designed to assess Theory of Mind (ToM) capabilities in AI agents within dynamic, goal-oriented, collaborative environments. They developed "Tomcat," an LLM-based agent, and implemented two variants: Fs-CoT (few-shot chain-of-thought) and CP (commonsense prompt). These variants were realized on GPT-4o, DeepSeek-R1, and Gemma-3-27B. A study with 52 human participants evaluated Tomcat's effectiveness, measuring intent accuracy, action optimality, and planning optimality. The Fs-CoT variant, particularly with GPT-4o and DeepSeek-R1, achieved performance comparable to human participants, demonstrating its potential for human-AI collaboration. The CP variant consistently underperformed humans, and Gemma-3-27B generally lagged behind other LLMs, even with Fs-CoT.

Key takeaway

For research scientists developing collaborative AI agents, you should integrate few-shot chain-of-thought (Fs-CoT) prompting to significantly improve an LLM's ability to infer human intent from ambiguous instructions and generate optimal action plans. This approach, particularly with models like GPT-4o or DeepSeek-R1, yields performance comparable to human collaborators, enhancing the agent's effectiveness in dynamic, goal-oriented tasks. Consider model capacity when selecting LLMs for complex ToM reasoning.

Key insights

Few-shot chain-of-thought prompting significantly enhances LLM Theory of Mind for collaborative task interpretation.

Principles

ToM is crucial for effective human-AI collaboration.
Structured reasoning exemplars improve LLM intent inference.
Model capacity impacts complex ToM reasoning performance.

Method

The Tomcat framework uses common ground, response generation, and demonstration exemplars (CP or Fs-CoT) to enable LLMs to interpret ambiguous instructions, infer human intent, and generate optimal actions and plans in a gridworld environment.

In practice

Use Fs-CoT prompting for LLMs in collaborative AI agents.
Prioritize LLMs like GPT-4o or DeepSeek-R1 for ToM tasks.
Ensure prompt engineering includes explicit rules for robust common ground.

Topics

Theory of Mind
Human-AI Collaboration
Large Language Models
Instruction Inference Task
Few-shot Chain-of-Thought (Fs-CoT)

Code references

fardinsaad/Tomcat-LLM

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.