How to Create an LLM Dataset | FineWeb Overview
Summary
Hugging Face's FineWeb dataset project provides a detailed methodology for creating large language model pre-training datasets, addressing the scarcity of public information on this process. The project introduces two English datasets: FineWeb, a 15 trillion token dataset derived from 96 Common Crawl snapshots since 2013, and FineWeb-Edu, a more distilled 1.3 trillion token version. Key steps include extracting raw HTML with Trafilatura, applying base filtering for adult content and language, and implementing a refined within-dump deduplication strategy after an initial global deduplication attempt proved detrimental. Additional quality filters, including C4-style and custom rules, were applied. FineWeb-Edu further leverages Llama 3 70B Instruct to score 500,000 samples for educational value, training a smaller classifier to filter the full dataset, resulting in models that outperform those trained on the larger FineWeb and other open datasets like RefinedWeb and C4. The project also notes an observed increase in Common Crawl data quality since 2022, potentially linked to the rise of LLM-generated synthetic data.
Key takeaway
For machine learning engineers building custom LLMs, understanding the FineWeb dataset creation process is crucial. You should prioritize raw data extraction and carefully consider deduplication strategies, as aggressive global deduplication can degrade dataset quality. Leverage smaller models for iterative dataset evaluation and explore using powerful LLMs like Llama 3 70B Instruct to bootstrap specialized data filters, potentially yielding higher-performing models from smaller, higher-quality datasets like FineWeb-Edu.
Key insights
High-quality LLM pre-training datasets require meticulous extraction, nuanced deduplication, and model-based evaluation.
Principles
- Extraction quality significantly impacts dataset utility.
- Deduplication is not monotonic; aggressive methods can harm data.
- Evaluate dataset quality using small models, not just heuristics.
Method
The FineWeb pipeline involves Common Crawl WARC extraction via Trafilatura, base filtering, within-dump MinHash deduplication, and custom quality filters. FineWeb-Edu adds LLM-based scoring (Llama 3 70B Instruct) to train a smaller classifier for educational content selection.
In practice
- Prefer raw HTML (WARC) with tools like Trafilatura for extraction.
- Implement within-dump deduplication to preserve valuable older data.
- Use large LLMs to bootstrap smaller, scalable data quality classifiers.
Topics
- LLM Pre-training
- Dataset Creation
- Common Crawl
- Data Deduplication
- Data Filtering
- FineWeb Dataset
- Synthetic Data
Best for: AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.