OpenThoughts-Agent: Data Recipes for Agentic Models
Summary
The OpenThoughts-Agent (OT-Agent) project introduces a fully open data curation pipeline designed for training broadly capable agentic language models, addressing a gap in public knowledge regarding data curation for such models. Unlike existing efforts like SWE-Smith or Nemotron-Terminal that target single benchmarks, OT-Agent aims for generalization across diverse agentic tasks. The project involved over 100 controlled ablation experiments to systematically investigate each pipeline stage, revealing insights into the importance of task sources and diversity. By assembling a 100K example training set from this pipeline, the researchers fine-tuned Qwen3-32B, achieving an average accuracy of 44.8% across seven agentic benchmarks. This represents a 3.9 percentage point improvement over Nemotron-Terminal-32B, the strongest existing open data agentic model, which scored 40.9%. Furthermore, the OT-Agent training data demonstrates strong scaling properties, outperforming alternative open datasets in compute-controlled comparisons at every training set size. All training sets, the data pipeline, experimental data, and models are publicly released at openthoughts.ai.
Key takeaway
For Machine Learning Engineers developing broadly capable agentic models, the OpenThoughts-Agent project offers a validated data curation pipeline to improve generalization. You should consider integrating the publicly released OT-Agent training sets and pipeline from openthoughts.ai into your development workflow. This approach can yield significant performance gains, as demonstrated by the 3.9 percentage point improvement over existing open data agents, helping you achieve higher accuracy across diverse agentic benchmarks.
Key insights
OpenThoughts-Agent provides an open data pipeline and insights for training agentic models that generalize across diverse tasks.
Principles
- Task source diversity is crucial for agentic model generalization.
- Systematic ablation experiments inform data pipeline optimization.
- Open data pipelines enhance research reproducibility.
Method
The OT-Agent method involves a systematic data curation pipeline, over 100 ablation experiments to investigate stages, and assembling a 100K example training set for fine-tuning models like Qwen3-32B.
In practice
- Fine-tune Qwen3-32B with OT-Agent data for 44.8% benchmark accuracy.
- Utilize openthoughts.ai resources for agentic model training.
- Prioritize diverse task sources in agentic data curation.
Topics
- Agentic Models
- Data Curation
- Large Language Models
- Model Fine-tuning
- Open Research
- Benchmarking
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.