Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Summary
Researchers introduce "Instruction Inference," a novel task designed to assess Theory of Mind (ToM) capabilities in AI agents within dynamic, goal-oriented, collaborative environments. They developed "Tomcat," an LLM-based agent, and implemented two variants: Fs-CoT (few-shot chain-of-thought) and CP (commonsense prompt). These variants were realized on GPT-4o, DeepSeek-R1, and Gemma-3-27B. A study with 52 human participants evaluated Tomcat's effectiveness, measuring intent accuracy, action optimality, and planning optimality. The Fs-CoT variant, particularly with GPT-4o and DeepSeek-R1, achieved performance comparable to human participants, demonstrating its potential for human-AI collaboration. The CP variant consistently underperformed humans, and Gemma-3-27B generally lagged behind other LLMs, even with Fs-CoT.
Key takeaway
For research scientists developing collaborative AI agents, you should integrate few-shot chain-of-thought (Fs-CoT) prompting to significantly improve an LLM's ability to infer human intent from ambiguous instructions and generate optimal action plans. This approach, particularly with models like GPT-4o or DeepSeek-R1, yields performance comparable to human collaborators, enhancing the agent's effectiveness in dynamic, goal-oriented tasks. Consider model capacity when selecting LLMs for complex ToM reasoning.
Key insights
Few-shot chain-of-thought prompting significantly enhances LLM Theory of Mind for collaborative task interpretation.
Principles
- ToM is crucial for effective human-AI collaboration.
- Structured reasoning exemplars improve LLM intent inference.
- Model capacity impacts complex ToM reasoning performance.
Method
The Tomcat framework uses common ground, response generation, and demonstration exemplars (CP or Fs-CoT) to enable LLMs to interpret ambiguous instructions, infer human intent, and generate optimal actions and plans in a gridworld environment.
In practice
- Use Fs-CoT prompting for LLMs in collaborative AI agents.
- Prioritize LLMs like GPT-4o or DeepSeek-R1 for ToM tasks.
- Ensure prompt engineering includes explicit rules for robust common ground.
Topics
- Theory of Mind
- Human-AI Collaboration
- Large Language Models
- Instruction Inference Task
- Few-shot Chain-of-Thought (Fs-CoT)
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.