MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

MolmoWeb introduces a new family of fully open multimodal web agents and an extensive open dataset, MolmoWebMix, designed to advance scientific understanding and reproducibility in web agent development. MolmoWebMix integrates over 100K synthetic task trajectories, 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. The MolmoWeb agents, available in 4B and 8B parameter sizes, function as instruction-conditioned visual-language action policies, predicting browser actions solely from task instructions and webpage screenshots without requiring HTML or accessibility trees. These agents achieve state-of-the-art performance on benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, outperforming other open-weight models such as Fara-7B and UI-Tars-1.5-7B. The 8B version even surpasses set-of-marks agents based on larger closed models like GPT-4o, with test-time scaling via parallel rollouts boosting pass@4 to 94.7% on WebVoyager.

Key takeaway

For research scientists developing web agents, MolmoWeb offers a robust open-source foundation that challenges proprietary model performance. You should consider integrating MolmoWeb agents and the MolmoWebMix dataset into your projects to enhance reproducibility and accelerate progress in visual web navigation. Leveraging the provided model checkpoints and evaluation harness can streamline your development and benchmarking efforts.

Key insights

Open web agents and data foster reproducibility and community progress in autonomous web interaction.

Principles

Method

MolmoWeb agents predict browser actions from task instructions and webpage screenshots, leveraging a diverse dataset (MolmoWebMix) combining synthetic and human demonstrations with GUI perception data.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.