Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

2026-01-28 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Active-Zero is a novel framework designed to enhance vision-language models (VLMs) through active environment exploration, moving beyond passive interaction with static image collections. It employs three co-evolving agents: a Searcher that retrieves relevant images from open-world repositories, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed-loop system enables self-scaffolding auto-curricula, allowing the model to autonomously construct its learning trajectory. Evaluated on Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieved a 53.97 average accuracy on reasoning tasks, marking a 5.7% improvement, and 59.77 on general understanding, a 3.9% improvement, consistently outperforming existing self-play baselines. The framework also showed robust performance on Qwen2.5-VL-3B-Instruct, with Iter2 consistently yielding the strongest results.

Key takeaway

For research scientists developing self-evolving vision-language models, Active-Zero demonstrates that actively curating visual data from open-world environments is crucial for scalable improvement. You should consider implementing a multi-agent framework that dynamically selects training data based on the model's evolving capabilities, rather than relying on static datasets, to achieve superior performance in both reasoning and general visual understanding.

Key insights

Active environment exploration with co-evolving agents significantly improves VLM reasoning and general understanding.

Principles

Active data curation is superior to passive interaction.
Self-scaffolding auto-curricula drive autonomous VLM improvement.
Iterative optimization of specialized agents enhances multimodal reasoning.

Method

Active-Zero uses a tri-agent framework (Searcher, Questioner, Solver) in an iterative self-play cycle. The Searcher retrieves images, the Questioner synthesizes tasks, and the Solver improves via reinforcement learning.

In practice

Implement a Searcher to dynamically curate training data.
Design a Questioner to generate multi-step reasoning tasks.
Utilize Group Relative Policy Optimization (GRPO) for Solver training.

Topics

Self-Evolving VLMs
Active Environment Exploration
Tri-Agent Architecture
Auto-Curriculum Learning
VLM Self-Play

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.