Zero-Shot Goal Recognition with Large Language Models

2026-05-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A systematic zero-shot evaluation of frontier Large Language Models (LLMs) as goal recognizers on classical PDDL benchmarks reveals uneven competence. The study, conducted by Kin Max Piamolini Gusmão et al., compares LLMs like GPT-4o, GPT-OSS, GPT-5.4, and Qwen 3.5 against a landmark-based approach across four domains: Blocks World, Campus, DriverLog, and Dock Worker Robots. Results indicate that while some models, specifically GPT-OSS and GPT-5.4, improve their Recall@1 and Recall@5 scores with increasing observation counts, GPT-4o's performance plateaus, suggesting reliance on initial world-knowledge priors rather than effective evidence integration. Qwen 3.5 consistently performs poorly. Qualitative analysis highlights common failure modes such as confabulation, overconfidence at sparse observations, and position bias, positioning goal recognition as a critical benchmark for LLM planning knowledge.

Key takeaway

For research scientists developing or evaluating LLMs for planning and reasoning tasks, you should consider goal recognition as a robust benchmark. This task effectively probes an LLM's ability to integrate sequential evidence, rather than merely exploiting world knowledge. Prioritize models that demonstrate improved performance with increasing observations, and investigate interventions to mitigate common failure modes like confabulation and position bias to enhance practical applicability.

Key insights

LLM goal recognition competence varies significantly, with some models integrating evidence effectively while others rely on world-knowledge priors.

Principles

Goal recognition is abductive, aligning with LLM strengths.
Evidence integration is a key differentiator in LLM performance.
Zero-shot evaluation reveals inherent LLM capabilities.

Method

The study uses a structured prompt template for zero-shot evaluation of LLMs on PDDL goal recognition problems, comparing Recall@k, Spread, and Accuracy against a landmark-based baseline.

In practice

Use goal recognition as a benchmark for LLM planning.
Focus on evidence integration for LLM-based recognizers.
Address confabulation and position bias in LLM outputs.

Topics

Large Language Models
Goal Recognition
PDDL Benchmarks
Zero-Shot Evaluation
Evidence Integration

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.