Zero-Shot Goal Recognition with Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A systematic zero-shot evaluation of frontier Large Language Models (LLMs) as goal recognizers on classical PDDL benchmarks reveals uneven competence. The study, conducted by Kin Max Piamolini Gusmão et al., compares LLMs like GPT-4o, GPT-OSS, GPT-5.4, and Qwen 3.5 against a landmark-based approach across four domains: Blocks World, Campus, DriverLog, and Dock Worker Robots. Results indicate that while some models, specifically GPT-OSS and GPT-5.4, improve their Recall@1 and Recall@5 scores with increasing observation counts, GPT-4o's performance plateaus, suggesting reliance on initial world-knowledge priors rather than effective evidence integration. Qwen 3.5 consistently performs poorly. Qualitative analysis highlights common failure modes such as confabulation, overconfidence at sparse observations, and position bias, positioning goal recognition as a critical benchmark for LLM planning knowledge.

Key takeaway

For research scientists developing or evaluating LLMs for planning and reasoning tasks, you should consider goal recognition as a robust benchmark. This task effectively probes an LLM's ability to integrate sequential evidence, rather than merely exploiting world knowledge. Prioritize models that demonstrate improved performance with increasing observations, and investigate interventions to mitigate common failure modes like confabulation and position bias to enhance practical applicability.

Key insights

LLM goal recognition competence varies significantly, with some models integrating evidence effectively while others rely on world-knowledge priors.

Principles

Method

The study uses a structured prompt template for zero-shot evaluation of LLMs on PDDL goal recognition problems, comparing Recall@k, Spread, and Accuracy against a landmark-based baseline.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.