Zero-Shot Goal Recognition with Large Language Models
Summary
A systematic zero-shot evaluation of frontier Large Language Models (LLMs) as goal recognizers on classical PDDL benchmarks reveals uneven competence. The study, conducted by Kin Max Piamolini Gusmão et al., compares LLMs like GPT-4o, GPT-OSS, GPT-5.4, and Qwen 3.5 against a landmark-based approach across four domains: Blocks World, Campus, DriverLog, and Dock Worker Robots. Results indicate that while some models, specifically GPT-OSS and GPT-5.4, improve their Recall@1 and Recall@5 scores with increasing observation counts, GPT-4o's performance plateaus, suggesting reliance on initial world-knowledge priors rather than effective evidence integration. Qwen 3.5 consistently performs poorly. Qualitative analysis highlights common failure modes such as confabulation, overconfidence at sparse observations, and position bias, positioning goal recognition as a critical benchmark for LLM planning knowledge.
Key takeaway
For research scientists developing or evaluating LLMs for planning and reasoning tasks, you should consider goal recognition as a robust benchmark. This task effectively probes an LLM's ability to integrate sequential evidence, rather than merely exploiting world knowledge. Prioritize models that demonstrate improved performance with increasing observations, and investigate interventions to mitigate common failure modes like confabulation and position bias to enhance practical applicability.
Key insights
LLM goal recognition competence varies significantly, with some models integrating evidence effectively while others rely on world-knowledge priors.
Principles
- Goal recognition is abductive, aligning with LLM strengths.
- Evidence integration is a key differentiator in LLM performance.
- Zero-shot evaluation reveals inherent LLM capabilities.
Method
The study uses a structured prompt template for zero-shot evaluation of LLMs on PDDL goal recognition problems, comparing Recall@k, Spread, and Accuracy against a landmark-based baseline.
In practice
- Use goal recognition as a benchmark for LLM planning.
- Focus on evidence integration for LLM-based recognizers.
- Address confabulation and position bias in LLM outputs.
Topics
- Large Language Models
- Goal Recognition
- PDDL Benchmarks
- Zero-Shot Evaluation
- Evidence Integration
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.