DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch
Summary
DeNovoSWE is a new large-scale dataset designed for training LLM-based code agents to generate complete software repositories from high-level documentation. Comprising 4,818 high-quality instances, DeNovoSWE is automatically constructed via a sandboxed agentic workflow that employs a "divide and conquer" strategy, an iterative critic-repair mechanism, and a difficulty-aware trajectory filtering method to balance data quality and diversity. This dataset addresses the critical scarcity of verifiable long-horizon software engineering training data. Empirical results show that fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improved its performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%. Similarly, Qwen3.5-35B-A3B saw gains from 43.8% to 50.0% on BeyondSWE-Doc2Repo and from 23.5% to 27.1% on NL2RepoBench, demonstrating its effectiveness in enhancing whole-repository generation capabilities.
Key takeaway
For AI Engineers developing LLM-based code agents for complex software engineering, you should consider DeNovoSWE's approach to data generation. Its automated, sandboxed pipeline and difficulty-aware filtering strategy provide a scalable method for creating high-quality, long-horizon training data. This can significantly improve your agents' ability to generate entire repositories from documentation, as demonstrated by substantial performance gains on benchmarks like BeyondSWE-Doc2Repo.
Key insights
Automated, structured data generation with difficulty-aware filtering scales long-horizon software engineering training for LLM agents.
Principles
- "Divide and conquer" simplifies complex tasks.
- Iterative critic-repair refines generated content.
- Difficulty-aware filtering balances quality and diversity.
Method
DeNovoSWE uses a sandboxed multi-agent system with a "divide" phase for capability decomposition and profiling, and a "conquer" phase for iterative draft-critic-repair documentation generation. It includes strict leakage prevention.
In practice
- Fine-tune LLMs for whole-repository generation.
- Use sandboxed agents for data curation.
- Implement dynamic filtering for varied task difficulty.
Topics
- LLM Code Agents
- Software Engineering
- Dataset Generation
- Whole-Repository Generation
- Long-Horizon Tasks
- Difficulty-Aware Filtering
- Supervised Fine-Tuning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.