Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

2024-10-22 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

LG AI Research has developed a scalable pipeline for automatically generating high-quality training data for web agents, addressing challenges in data creation and evaluation. Their method introduces a novel constraint-based evaluation framework that provides fine-grained assessment of task progress, enabling the use of partially successful trajectories to significantly expand training data. This pipeline leverages few-shot prompted large language models (LLMs) like LLAMA 3.1 405B for trajectory generation and smaller models like LLAMA 3.3 70B and Gemma 3 27B for evaluation. The approach is validated on a new benchmark, BookingArena, which features complex booking tasks across 20 popular websites. A distilled 24B parameter student model, Mistral 3 Small, trained using LoRA on this data, outperforms open-source alternatives and matches or exceeds commercial systems, despite being considerably smaller than the teacher models.

Key takeaway

For AI Scientists and Research Scientists developing web agents, this work demonstrates that focusing on fine-grained, constraint-based evaluation and leveraging partially successful trajectories can significantly improve model performance and data efficiency. You should consider implementing similar constraint-based evaluation frameworks to extract more value from generated data, enabling the training of smaller, more efficient models that can compete with or surpass larger commercial systems on complex web tasks like those in BookingArena.

Key insights

A constraint-based evaluation framework enables scalable, high-quality web agent training by leveraging partially successful trajectories.

Principles

Fine-grained evaluation improves data utility.
Partial success trajectories expand training datasets.
Smaller models can outperform larger ones via distillation.

Method

The method involves automatic task and trajectory generation using few-shot LLMs, followed by constraint-based evaluation to curate high-quality, partially successful trajectories for student model distillation via LoRA fine-tuning.

In practice

Use constraint-based metrics for nuanced agent evaluation.
Incorporate partial trajectories to scale training data.
LoRA fine-tuning can be more effective than full fine-tuning.

Topics

Web Agents
Automatic Data Generation
Constraint-based Evaluation
Model Distillation
BookingArena Benchmark

Code references

browser-use/browser-use

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.