Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

WebGraphMix is a novel, lightweight data selection framework designed to optimize pretraining data composition for large language models. It addresses the computational overhead of existing methods by computing structural centrality scores over the Common Crawl host-level web graph, rather than relying on auxiliary classifiers or labeled data. The framework varies the proportion of central versus peripheral documents, hypothesizing that central hosts provide reusable abstractions while peripheral hosts offer specialized, long-tail knowledge. Integrating WebGraphMix into the DataComp-LM pipeline, models trained at 400M and 1B parameters with 8B and 28B tokens, respectively, showed a 1:1 central/peripheral mixture achieving 41.4% average performance across 23 tasks, outperforming uniform sampling at 39.8%. Combining these structural scores with document-level quality classifier scores further boosted performance to 43.8%. This demonstrates web graph topology's significant, orthogonal contribution to pretraining data curation.

Key takeaway

For machine learning engineers curating pretraining datasets for large language models, WebGraphMix offers an efficient, label-free approach to enhance model performance. By leveraging web graph centrality to balance central and peripheral data, you can achieve significant gains, such as the reported 41.4% average performance with a 1:1 mix. Consider integrating this structural scoring with existing content-based quality filters to further improve your data mixtures, potentially reaching 43.8% performance without additional model training overhead.

Key insights

Web graph centrality effectively guides pretraining data selection for LLMs, capturing complementary knowledge efficiently.

Principles

Method

WebGraphMix computes structural centrality scores on the Common Crawl host-level web graph to vary the proportion of central versus peripheral documents in pretraining mixtures.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.