DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

2026-03-01 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

DR-Venus is a 4B parameter deep research agent designed for edge-scale deployment, trained exclusively on approximately 10K open data points. It significantly outperforms existing agentic models under 9B parameters on deep research benchmarks like BrowseComp, BrowseComp-ZH, and xBench-DS-2510, while also reducing the performance gap to much larger 30B-class systems. The training methodology involves a two-stage process: initial agentic supervised fine-tuning (SFT) with strict data cleaning and long-horizon trajectory resampling, followed by agentic reinforcement learning (RL) using an Information-Gain Policy Optimization (IGPO) algorithm. This RL stage incorporates turn-level rewards based on information gain and format-aware regularization to enhance supervision density and credit assignment, crucial for small models tackling long-horizon tasks. The project releases its models, code, and key recipes to foster reproducible research.

Key takeaway

For NLP engineers developing edge-scale deep research agents, DR-Venus demonstrates that strong performance is achievable with 4B models and limited open data. You should prioritize rigorous data cleaning and long-horizon trajectory resampling during SFT, and implement turn-level reinforcement learning with information gain and format penalties to stabilize tool use and execution. Consider exploring test-time scaling techniques to further enhance your agent's capabilities.

Key insights

Small models can achieve strong deep research capabilities through high-quality data and effective utilization.

Principles

Data quality and utilization are critical for small agent training.
Turn-level RL improves long-horizon task reliability.
Test-time scaling can unlock small model potential.

Method

DR-Venus uses a two-stage training: SFT with data cleaning and trajectory resampling, then RL with IGPO, turn-level information gain, and format-aware rewards for dense supervision.

In practice

Oversample long-horizon trajectories for SFT.
Implement turn-level format penalties in RL.
Use LLM-judges for information gain rewards.

Topics

DR-Venus
Edge-Scale Deep Research Agents
Small Language Models
Agentic Supervised Fine-Tuning
Information-Gain Policy Optimization

Code references

inclusionAI/DR-Venus

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.