Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

This paper introduces a visual-native agent harness and On-policy Data Evolution (ODE) to address limitations in multimodal deep search agents. The harness employs an "image bank reference protocol" to make tool-returned visual evidence persistently reusable across chained operations, unifying nine core tools. ODE is a closed-loop data generator that refines training data (for both supervised fine-tuning and reinforcement learning) based on the evolving capabilities of the agent being trained. Experiments show ODE significantly boosts Qwen3-VL-8B agent performance from 24.9% to 39.0% and Qwen3-VL-30B from 30.6% to 41.5% on average across eight benchmarks, outperforming Gemini-2.5 Pro (37.9%) at 8B. The framework's components enhance visual evidence gathering and data quality.

Key takeaway

For AI Scientists and Machine Learning Engineers developing multimodal agents, integrating a visual-native harness with an image bank is crucial for complex visual reasoning tasks. You should adopt on-policy data evolution to dynamically curate training data, ensuring it aligns with your agent's current learning frontier. This approach yields higher-quality, more diverse demonstrations and tasks, significantly improving agent performance on deep search benchmarks.

Key insights

Multimodal agents benefit from reusable visual state and policy-aware data evolution for deep search.

Principles

Visual evidence should be persistent and reusable.
Training data generation must adapt to policy needs.

Method

The Visual-Native Agent Harness uses an image bank reference protocol for reusable visual state. ODE refines data by synthesizing tasks, executing policy rollouts, analyzing traces with a rubric, and updating generator configurations.

In practice

Implement an image bank for tool-generated visuals.
Design data generators with closed-loop feedback.

Topics

Multimodal Agents
Deep Search
On-policy Data Evolution
Visual-Native Harness
Image Bank Protocol
Qwen3-VL

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.