Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The planning experience exploration and utilization (PEEU) method addresses weak planning and limited cross-website generalization in small open-source Multimodal Large Language Models (MLLMs) for GUI task automation. PEEU autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high-level training data. Complementing this, the task decomposition hierarchical analysis framework (TDHAF) systematically studies compositional generalization across low, middle, and high task granularities. Analysis reveals that mastering low-level atomic skills does not guarantee high-level planning competence, while high-level task training yields stronger out-of-distribution (OOD) generalization. Experiments show PEEU's 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model, demonstrating the importance of constructing hindsight high-level tasks and utilizing experiences for OOD planning abilities in small MLLMs.

Key takeaway

For Machine Learning Engineers developing multimodal web agents with small MLLMs, you should prioritize methods like PEEU that utilize autonomous exploration and hindsight experience to synthesize high-level training data. This approach significantly enhances out-of-distribution planning capabilities and cross-website generalization, enabling smaller models to outperform much larger commercial alternatives. Consider integrating such experience-driven learning to improve agent robustness and efficiency.

Key insights

Hindsight experience utilization and autonomous exploration significantly boost planning and generalization in small MLLMs.

Principles

Method

PEEU autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high-level training data for MLLM task planning.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.