Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

2025-08-07 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The LAMO framework introduces a scalable, lightweight Multimodal Large Language Model (MLLM) called LAMO-3B for Graphical User Interface (GUI) automation, addressing the high deployment costs and limited task scalability of larger models on resource-constrained devices. LAMO-3B, a 3-billion parameter model, is trained using role-oriented data synthesis and a two-stage process: supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception, followed by reinforcement learning for cooperative exploration across roles. This framework enables LAMO-3B to function as a monolithic end-to-end agent or as a plug-and-play policy executor within a multi-agent system (MAS), orchestrating roles like Observer, Planner, Allocator, and Executor. Evaluations on static benchmarks (ScreenSpot-pro, AndroidControl) and online environments (MiniWob++, AndroidWorld, OSWorld) demonstrate its effectiveness in enhancing GUI-specific knowledge, visual perception, and task scalability, especially when paired with advanced planners like GPT-5 or Gemini-2.5-Pro.

Key takeaway

For research scientists developing GUI automation agents, LAMO-3B offers a compelling approach to overcome resource constraints while maintaining task scalability. You should consider adopting a hybrid planner-executor architecture, leveraging lightweight models like LAMO-3B for low-level execution and advanced MLLMs for high-level planning. This strategy allows your agents to benefit from continuous planner improvements, potentially raising the performance ceiling for complex, long-horizon GUI tasks.

Key insights

LAMO-3B enables scalable, lightweight GUI automation via multi-role orchestration and a two-stage training approach.

Principles

Lightweight MLLMs can achieve task scalability through role-oriented orchestration.
Hybrid planner-executor models enhance performance in complex GUI tasks.
Data synthesis and two-stage training improve GUI-specific knowledge.

Method

LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization, and (ii) reinforcement learning for role-oriented cooperative exploration.

In practice

Deploy LAMO-3B as a policy executor with advanced MLLM planners.
Utilize role-oriented data synthesis for GUI-specific knowledge transfer.
Implement two-stage SFT and RL for lightweight agent training.

Topics

GUI Automation
Lightweight MLLMs
Multi-role Orchestration
Multi-agent Systems
Supervised Fine-tuning

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.