Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The LAMO framework introduces a novel approach to enable lightweight Multimodal Large Language Models (MLLMs) to perform complex Graphical User Interface (GUI) automation on resource-constrained devices. Traditional MLLM-powered GUI agents face high deployment costs and limited task scalability, especially in multi-agent systems (MAS). LAMO addresses this by combining role-oriented data synthesis with a two-stage training process. This process involves supervised fine-tuning using Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception, followed by reinforcement learning for cooperative, role-oriented exploration. The resulting agent, LAMO-3B, is a 3-billion parameter model designed for task-scalable native GUI automation, supporting both monolithic execution and MAS-style orchestration. LAMO-3B can integrate with advanced planners as a plug-and-play policy executor, enhancing its performance ceiling, and has been validated through extensive static and online evaluations.

Key takeaway

For research scientists developing GUI automation solutions, LAMO-3B offers a path to deploy capable MLLM agents on resource-constrained hardware without sacrificing task scalability. You should consider integrating LAMO-3B as a plug-and-play policy executor with your existing advanced planners to leverage its performance benefits and expand its capabilities in multi-agent systems.

Key insights

LAMO enables lightweight MLLMs to perform scalable GUI automation via multi-role orchestration and a two-stage training recipe.

Principles

Method

LAMO uses role-oriented data synthesis, then two-stage training: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy, and (ii) reinforcement learning for cooperative exploration.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.