DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
Summary
DR-Venus is a 4B parameter deep research agent designed for edge-scale deployment, trained exclusively on approximately 10K open data points. It significantly outperforms existing agentic models under 9B parameters on deep research benchmarks like BrowseComp, BrowseComp-ZH, and xBench-DS-2510, while also reducing the performance gap to much larger 30B-class systems. The training methodology involves a two-stage process: initial agentic supervised fine-tuning (SFT) with strict data cleaning and long-horizon trajectory resampling, followed by agentic reinforcement learning (RL) using an Information-Gain Policy Optimization (IGPO) algorithm. This RL stage incorporates turn-level rewards based on information gain and format-aware regularization to enhance supervision density and credit assignment, crucial for small models tackling long-horizon tasks. The project releases its models, code, and key recipes to foster reproducible research.
Key takeaway
For NLP engineers developing edge-scale deep research agents, DR-Venus demonstrates that strong performance is achievable with 4B models and limited open data. You should prioritize rigorous data cleaning and long-horizon trajectory resampling during SFT, and implement turn-level reinforcement learning with information gain and format penalties to stabilize tool use and execution. Consider exploring test-time scaling techniques to further enhance your agent's capabilities.
Key insights
Small models can achieve strong deep research capabilities through high-quality data and effective utilization.
Principles
- Data quality and utilization are critical for small agent training.
- Turn-level RL improves long-horizon task reliability.
- Test-time scaling can unlock small model potential.
Method
DR-Venus uses a two-stage training: SFT with data cleaning and trajectory resampling, then RL with IGPO, turn-level information gain, and format-aware rewards for dense supervision.
In practice
- Oversample long-horizon trajectories for SFT.
- Implement turn-level format penalties in RL.
- Use LLM-judges for information gain rewards.
Topics
- DR-Venus
- Edge-Scale Deep Research Agents
- Small Language Models
- Agentic Supervised Fine-Tuning
- Information-Gain Policy Optimization
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.