OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, long

Summary

OpenWebRL is an open framework designed for training visual web agents using online multi-turn reinforcement learning (RL) directly on live websites. It addresses the scalability bottleneck of relying on expensive, static supervised datasets by providing a full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, OpenWebRL-4B, a 4B-parameter agent, was trained with only 0.4K initialization trajectories and 2.2K open-ended RL training tasks. It achieved 67.0% success on Online-Mind2Web and 64.0% on DeepShop, and 74.1% on WebVoyager, achieving leading open-source performance. The work also systematically studies key design choices for effective online RL in visual web agents and analyzes how RL improves agentic reasoning.

Key takeaway

For AI Scientists and ML Engineers developing visual web agents, consider adopting online multi-turn reinforcement learning to overcome data scalability issues. You should implement a supervised warm-start with minimal data, integrate robust browser infrastructure, and utilize multimodal context management to enhance agent performance on dynamic websites. This approach can significantly reduce reliance on expensive, static datasets, making agent development more cost-efficient and reproducible.

Key insights

Online multi-turn reinforcement learning on live websites offers a scalable path for training capable visual web agents.

Principles

Supervised warm-start improves exploration.
Multimodal context management is crucial.
Trajectory-level judging guides policy optimization.

Method

OpenWebRL's method involves a supervised warm start, an agent harness with multi-tool action execution and textual feedback, and a multimodal multi-turn GRPO objective with trajectory-level judging.

In practice

Use a fault-tolerant browser environment.
Implement multi-tool-call interfaces for efficiency.
Compress older visual history into text.

Topics

Visual Web Agents
Online Reinforcement Learning
Multi-turn RL
Browser Automation
OpenWebRL Framework
GRPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.