Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study introduces agentic automata learning to assess tool-calling LLM agents' ability to uncover hidden environments. This method involves agents interacting with an oracle to infer a hidden deterministic finite automaton (DFA) using membership queries (checking if a string belongs to the target language) and equivalence queries (verifying if a proposed DFA is correct). The research establishes a scalable testbed with controlled complexity and measurable efficiency, allowing comparison against classic automata-learning algorithms. Evaluating state-of-the-art LLMs, the findings indicate a sharp decline in performance as DFA size increases. Reasoning models significantly outperform non-reasoning models, though trajectory analyses reveal consistent failures in query planning, evidence integration, and hypothesis construction. While current LLM agents demonstrate some capacity for non-trivial interactive discovery, they remain considerably less robust and efficient than established algorithms for this task.

Key takeaway

For AI Scientists developing LLM agents for interactive discovery, recognize that current models face significant limitations in scalability and efficiency when inferring complex "world models." You should prioritize integrating robust query planning and evidence integration mechanisms into your agent architectures. This will be crucial for overcoming recurring failures and improving performance beyond simple environments, especially when classic algorithms offer superior robustness.

Key insights

LLM agents can perform interactive discovery but struggle with scalability and efficiency compared to classic automata learning.

Principles

Method

Agentic automata learning involves LLM agents inferring a hidden DFA via oracle interactions using membership and equivalence queries, providing a scalable testbed for evaluating discovery capabilities.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.