Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Summary
The LAMO framework introduces a scalable, lightweight Multimodal Large Language Model (MLLM) called LAMO-3B for Graphical User Interface (GUI) automation, addressing the high deployment costs and limited task scalability of larger models on resource-constrained devices. LAMO-3B, a 3-billion parameter model, is trained using role-oriented data synthesis and a two-stage process: supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception, followed by reinforcement learning for cooperative exploration across roles. This framework enables LAMO-3B to function as a monolithic end-to-end agent or as a plug-and-play policy executor within a multi-agent system (MAS), orchestrating roles like Observer, Planner, Allocator, and Executor. Evaluations on static benchmarks (ScreenSpot-pro, AndroidControl) and online environments (MiniWob++, AndroidWorld, OSWorld) demonstrate its effectiveness in enhancing GUI-specific knowledge, visual perception, and task scalability, especially when paired with advanced planners like GPT-5 or Gemini-2.5-Pro.
Key takeaway
For research scientists developing GUI automation agents, LAMO-3B offers a compelling approach to overcome resource constraints while maintaining task scalability. You should consider adopting a hybrid planner-executor architecture, leveraging lightweight models like LAMO-3B for low-level execution and advanced MLLMs for high-level planning. This strategy allows your agents to benefit from continuous planner improvements, potentially raising the performance ceiling for complex, long-horizon GUI tasks.
Key insights
LAMO-3B enables scalable, lightweight GUI automation via multi-role orchestration and a two-stage training approach.
Principles
- Lightweight MLLMs can achieve task scalability through role-oriented orchestration.
- Hybrid planner-executor models enhance performance in complex GUI tasks.
- Data synthesis and two-stage training improve GUI-specific knowledge.
Method
LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization, and (ii) reinforcement learning for role-oriented cooperative exploration.
In practice
- Deploy LAMO-3B as a policy executor with advanced MLLM planners.
- Utilize role-oriented data synthesis for GUI-specific knowledge transfer.
- Implement two-stage SFT and RL for lightweight agent training.
Topics
- GUI Automation
- Lightweight MLLMs
- Multi-role Orchestration
- Multi-agent Systems
- Supervised Fine-tuning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.