Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Summary
Weblica (Web Replica) is a novel framework designed to create reproducible and scalable training environments for visual web agents, addressing the challenges of the web's complexity and dynamic nature. It combines two mechanisms: HTTP-level caching to record and replay real website interactions, capturing stable visual states while preserving interactive behavior, and an LLM-based environment synthesis pipeline that generates interactive web environments grounded in real websites and core web navigation skills. This framework enables scaling Reinforcement Learning (RL) training to thousands of diverse environments and tasks. The resulting model, Weblica-8B, fine-tuned from the Qwen3-VL family, operates purely on screenshots and achieves 39.2% pass@1 on Online-Mind2Web with 30 steps, outperforming open-weight baselines of similar size and demonstrating competitiveness with API models like OpenAI computer-use-preview and Gemini computer-use-preview.
Key takeaway
For research scientists developing visual web agents, Weblica offers a robust approach to overcome data scarcity and environmental instability. You should consider integrating HTTP-level caching and LLM-based environment synthesis into your training pipelines to create diverse, reproducible datasets. This strategy allows for large-scale RL training, potentially leading to agents that outperform existing open-weight models and approach the capabilities of proprietary API models, even with fewer inference steps.
Key insights
Weblica enables scalable, reproducible web agent training via HTTP caching and LLM-driven synthetic environment generation.
Principles
- Combine real-world data with synthetic generation for diversity.
- HTTP-level caching ensures reproducible web interactions.
- LLM-based synthesis scales environment creation.
Method
Weblica uses HTTP-level caching with automated rule generation to replay web interactions, and an LLM-based pipeline (Claude Code) to synthesize diverse, interactive web environments with specific capabilities, domains, and visual styles.
In practice
- Use HTTP caching to stabilize dynamic web content for training.
- Employ LLMs to generate diverse, interactive web environments.
- Train visual agents on raw screenshots for better generalization.
Topics
- Weblica Framework
- Visual Web Agents
- HTTP-level Caching
- LLM-based Environment Synthesis
- Reinforcement Learning Training
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.