Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
Summary
LG AI Research has developed a scalable pipeline for automatically generating high-quality training data for web agents, addressing challenges in data creation and evaluation. Their method introduces a novel constraint-based evaluation framework that provides fine-grained assessment of task progress, enabling the use of partially successful trajectories to significantly expand training data. This pipeline leverages few-shot prompted large language models (LLMs) like LLAMA 3.1 405B for trajectory generation and smaller models like LLAMA 3.3 70B and Gemma 3 27B for evaluation. The approach is validated on a new benchmark, BookingArena, which features complex booking tasks across 20 popular websites. A distilled 24B parameter student model, Mistral 3 Small, trained using LoRA on this data, outperforms open-source alternatives and matches or exceeds commercial systems, despite being considerably smaller than the teacher models.
Key takeaway
For AI Scientists and Research Scientists developing web agents, this work demonstrates that focusing on fine-grained, constraint-based evaluation and leveraging partially successful trajectories can significantly improve model performance and data efficiency. You should consider implementing similar constraint-based evaluation frameworks to extract more value from generated data, enabling the training of smaller, more efficient models that can compete with or surpass larger commercial systems on complex web tasks like those in BookingArena.
Key insights
A constraint-based evaluation framework enables scalable, high-quality web agent training by leveraging partially successful trajectories.
Principles
- Fine-grained evaluation improves data utility.
- Partial success trajectories expand training datasets.
- Smaller models can outperform larger ones via distillation.
Method
The method involves automatic task and trajectory generation using few-shot LLMs, followed by constraint-based evaluation to curate high-quality, partially successful trajectories for student model distillation via LoRA fine-tuning.
In practice
- Use constraint-based metrics for nuanced agent evaluation.
- Incorporate partial trajectories to scale training data.
- LoRA fine-tuning can be more effective than full fine-tuning.
Topics
- Web Agents
- Automatic Data Generation
- Constraint-based Evaluation
- Model Distillation
- BookingArena Benchmark
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.