Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
Summary
WebGraphMix is a novel, lightweight data selection framework designed to optimize pretraining data composition for large language models. It addresses the computational overhead of existing methods by computing structural centrality scores over the Common Crawl host-level web graph, rather than relying on auxiliary classifiers or labeled data. The framework varies the proportion of central versus peripheral documents, hypothesizing that central hosts provide reusable abstractions while peripheral hosts offer specialized, long-tail knowledge. Integrating WebGraphMix into the DataComp-LM pipeline, models trained at 400M and 1B parameters with 8B and 28B tokens, respectively, showed a 1:1 central/peripheral mixture achieving 41.4% average performance across 23 tasks, outperforming uniform sampling at 39.8%. Combining these structural scores with document-level quality classifier scores further boosted performance to 43.8%. This demonstrates web graph topology's significant, orthogonal contribution to pretraining data curation.
Key takeaway
For machine learning engineers curating pretraining datasets for large language models, WebGraphMix offers an efficient, label-free approach to enhance model performance. By leveraging web graph centrality to balance central and peripheral data, you can achieve significant gains, such as the reported 41.4% average performance with a 1:1 mix. Consider integrating this structural scoring with existing content-based quality filters to further improve your data mixtures, potentially reaching 43.8% performance without additional model training overhead.
Key insights
Web graph centrality effectively guides pretraining data selection for LLMs, capturing complementary knowledge efficiently.
Principles
- Central web hosts expose reusable abstractions.
- Peripheral web hosts encode specialized knowledge.
- Web graph topology offers orthogonal data curation insights.
Method
WebGraphMix computes structural centrality scores on the Common Crawl host-level web graph to vary the proportion of central versus peripheral documents in pretraining mixtures.
In practice
- Mix central and peripheral web data at a 1:1 ratio.
- Combine graph centrality with document quality scores.
Topics
- Pretraining Data Selection
- Web Graph Centrality
- Large Language Models
- Common Crawl
- Data Curation
- Graph Topology
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.