Spokes: Optimizing for Diverse Pretraining Data Selection
Summary
Spokes introduces a novel probabilistic diversification framework designed to optimize pretraining data selection by directly addressing data diversity, a critical factor for improving model performance under fixed data budgets. This method, based on the G-Vendi score and optimized via exponentiated gradient descent, significantly enhances subset diversity, achieving a +489 increase in G-Vendi score on a 500k-sample subset compared to random sampling. Evaluated on FineWeb and DCLM datasets, SPOKES consistently outperforms existing approaches. Specifically, a diversity-only application of SPOKES improves average downstream performance by +0.4 points on DCLM and +0.5 points on FineWeb over random sampling. Crucially, jointly optimizing for both data quality and diversity yields the strongest results, with SPOKES achieving gains of +1.5 points on DCLM and +1.4 points on FineWeb, surpassing baselines like semantic deduplication and quality filtering.
Key takeaway
For Machine Learning Engineers optimizing large language model pretraining data, consider integrating SPOKES to enhance dataset diversity. Your models can achieve significant performance gains, specifically +1.5 points on DCLM and +1.4 points on FineWeb, by jointly optimizing for both data quality and diversity. This approach moves beyond traditional semantic deduplication, offering a more robust method to improve downstream task performance under fixed data budgets.
Key insights
Directly optimizing data diversity using G-Vendi score and exponentiated gradient descent significantly boosts pretraining performance.
Principles
- Diversity is a set-level property, not individual.
- Direct diversity optimization outperforms proxies.
- Jointly optimizing quality and diversity yields superior results.
Method
A probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent, selects diverse data subsets.
In practice
- Apply SPOKES to FineWeb for +1.4 points gain.
- Use SPOKES on DCLM for +1.5 points gain.
- Combine quality filtering with diversity optimization.
Topics
- Data Diversity Optimization
- Pretraining Data Selection
- G-Vendi Score
- Exponentiated Gradient Descent
- FineWeb Dataset
- DCLM Dataset
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.