Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Summary
The LAMO framework introduces a novel approach to enable lightweight Multimodal Large Language Models (MLLMs) to perform complex Graphical User Interface (GUI) automation on resource-constrained devices. Traditional MLLM-powered GUI agents face high deployment costs and limited task scalability, especially in multi-agent systems (MAS). LAMO addresses this by combining role-oriented data synthesis with a two-stage training process. This process involves supervised fine-tuning using Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception, followed by reinforcement learning for cooperative, role-oriented exploration. The resulting agent, LAMO-3B, is a 3-billion parameter model designed for task-scalable native GUI automation, supporting both monolithic execution and MAS-style orchestration. LAMO-3B can integrate with advanced planners as a plug-and-play policy executor, enhancing its performance ceiling, and has been validated through extensive static and online evaluations.
Key takeaway
For research scientists developing GUI automation solutions, LAMO-3B offers a path to deploy capable MLLM agents on resource-constrained hardware without sacrificing task scalability. You should consider integrating LAMO-3B as a plug-and-play policy executor with your existing advanced planners to leverage its performance benefits and expand its capabilities in multi-agent systems.
Key insights
LAMO enables lightweight MLLMs to perform scalable GUI automation via multi-role orchestration and a two-stage training recipe.
Principles
- Role-oriented data synthesis improves task scalability.
- Two-stage training enhances knowledge and cooperation.
- Plug-and-play policies boost performance ceiling.
Method
LAMO uses role-oriented data synthesis, then two-stage training: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy, and (ii) reinforcement learning for cooperative exploration.
In practice
- Deploy LAMO-3B for GUI automation on edge devices.
- Integrate LAMO-3B with existing advanced planners.
- Utilize role-oriented data for MLLM fine-tuning.
Topics
- GUI Agents
- Multimodal Large Language Models
- LAMO Framework
- Multi-role Orchestration
- Knowledge Distillation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.