Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

2026-06-09 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

WebGraphMix is a novel, lightweight data selection framework designed to optimize pretraining data composition for large language models. It addresses the computational overhead of existing methods by computing structural centrality scores over the Common Crawl host-level web graph, rather than relying on auxiliary classifiers or labeled data. The framework varies the proportion of central versus peripheral documents, hypothesizing that central hosts provide reusable abstractions while peripheral hosts offer specialized, long-tail knowledge. Integrating WebGraphMix into the DataComp-LM pipeline, models trained at 400M and 1B parameters with 8B and 28B tokens, respectively, showed a 1:1 central/peripheral mixture achieving 41.4% average performance across 23 tasks, outperforming uniform sampling at 39.8%. Combining these structural scores with document-level quality classifier scores further boosted performance to 43.8%. This demonstrates web graph topology's significant, orthogonal contribution to pretraining data curation.

Key takeaway

For machine learning engineers curating pretraining datasets for large language models, WebGraphMix offers an efficient, label-free approach to enhance model performance. By leveraging web graph centrality to balance central and peripheral data, you can achieve significant gains, such as the reported 41.4% average performance with a 1:1 mix. Consider integrating this structural scoring with existing content-based quality filters to further improve your data mixtures, potentially reaching 43.8% performance without additional model training overhead.

Key insights

Web graph centrality effectively guides pretraining data selection for LLMs, capturing complementary knowledge efficiently.

Principles

Central web hosts expose reusable abstractions.
Peripheral web hosts encode specialized knowledge.
Web graph topology offers orthogonal data curation insights.

Method

WebGraphMix computes structural centrality scores on the Common Crawl host-level web graph to vary the proportion of central versus peripheral documents in pretraining mixtures.

In practice

Mix central and peripheral web data at a 1:1 ratio.
Combine graph centrality with document quality scores.

Topics

Pretraining Data Selection
Web Graph Centrality
Large Language Models
Common Crawl
Data Curation
Graph Topology

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.