MolmoWeb: Generating Synthetic Data

· Source: Ai2 · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Intermediate, short

Summary

This content demonstrates how to generate web browsing trajectories using an inference code repository, specifically with the Momo web model and a headless Chromium browser via Playwright. Users define tasks in a JSON format, specifying an ID, prompt, and task type. The system then executes commands to simulate browsing, capturing screenshots, model thoughts, and actions like "go-to" or "mouse click." The output includes a `trajectory.html` visualization, a `trajectory.json` for training, and `metadata.json`. The demonstration also covers using alternative agents, such as the Gemini AX3 agent, which processes an accessibility tree (axe tree) textual representation of web pages instead of screenshots, still producing valid trajectories for training purposes. This approach enables synthetic data generation at scale for pre-training or fine-tuning web agents.

Key takeaway

For AI Engineers building or fine-tuning web agents, this codebase offers a direct method to generate synthetic training data. You can define specific browsing tasks and use either vision-based (Momo web) or accessibility tree-based (Gemini AX3) agents to create realistic interaction trajectories. This capability significantly reduces the need for manual data collection, accelerating model development and iteration cycles for web automation tasks.

Key insights

Generate synthetic web browsing data at scale using LLM or vision-based agents for training.

Principles

Method

Define tasks in JSON (ID, prompt, type), execute a command specifying environment and agent (e.g., Momo web, Gemini AX3), then analyze generated `trajectory.html` and `trajectory.json`.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ai2.