Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
Summary
This paper introduces a visual-native agent harness and On-policy Data Evolution (ODE) to address limitations in multimodal deep search agents. The harness employs an "image bank reference protocol" to make tool-returned visual evidence persistently reusable across chained operations, unifying nine core tools. ODE is a closed-loop data generator that refines training data (for both supervised fine-tuning and reinforcement learning) based on the evolving capabilities of the agent being trained. Experiments show ODE significantly boosts Qwen3-VL-8B agent performance from 24.9% to 39.0% and Qwen3-VL-30B from 30.6% to 41.5% on average across eight benchmarks, outperforming Gemini-2.5 Pro (37.9%) at 8B. The framework's components enhance visual evidence gathering and data quality.
Key takeaway
For AI Scientists and Machine Learning Engineers developing multimodal agents, integrating a visual-native harness with an image bank is crucial for complex visual reasoning tasks. You should adopt on-policy data evolution to dynamically curate training data, ensuring it aligns with your agent's current learning frontier. This approach yields higher-quality, more diverse demonstrations and tasks, significantly improving agent performance on deep search benchmarks.
Key insights
Multimodal agents benefit from reusable visual state and policy-aware data evolution for deep search.
Principles
- Visual evidence should be persistent and reusable.
- Training data generation must adapt to policy needs.
Method
The Visual-Native Agent Harness uses an image bank reference protocol for reusable visual state. ODE refines data by synthesizing tasks, executing policy rollouts, analyzing traces with a rubric, and updating generator configurations.
In practice
- Implement an image bank for tool-generated visuals.
- Design data generators with closed-loop feedback.
Topics
- Multimodal Agents
- Deep Search
- On-policy Data Evolution
- Visual-Native Harness
- Image Bank Protocol
- Qwen3-VL
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.