Spokes: Optimizing for Diverse Pretraining Data Selection

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Spokes introduces a novel probabilistic diversification framework designed to optimize pretraining data selection by directly addressing data diversity, a critical factor for improving model performance under fixed data budgets. This method, based on the G-Vendi score and optimized via exponentiated gradient descent, significantly enhances subset diversity, achieving a +489 increase in G-Vendi score on a 500k-sample subset compared to random sampling. Evaluated on FineWeb and DCLM datasets, SPOKES consistently outperforms existing approaches. Specifically, a diversity-only application of SPOKES improves average downstream performance by +0.4 points on DCLM and +0.5 points on FineWeb over random sampling. Crucially, jointly optimizing for both data quality and diversity yields the strongest results, with SPOKES achieving gains of +1.5 points on DCLM and +1.4 points on FineWeb, surpassing baselines like semantic deduplication and quality filtering.

Key takeaway

For Machine Learning Engineers optimizing large language model pretraining data, consider integrating SPOKES to enhance dataset diversity. Your models can achieve significant performance gains, specifically +1.5 points on DCLM and +1.4 points on FineWeb, by jointly optimizing for both data quality and diversity. This approach moves beyond traditional semantic deduplication, offering a more robust method to improve downstream task performance under fixed data budgets.

Key insights

Directly optimizing data diversity using G-Vendi score and exponentiated gradient descent significantly boosts pretraining performance.

Principles

Method

A probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent, selects diverse data subsets.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.