How to Create an LLM Dataset | FineWeb Overview

· Source: HuggingFace · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

Hugging Face's FineWeb dataset project provides a detailed methodology for creating large language model pre-training datasets, addressing the scarcity of public information on this process. The project introduces two English datasets: FineWeb, a 15 trillion token dataset derived from 96 Common Crawl snapshots since 2013, and FineWeb-Edu, a more distilled 1.3 trillion token version. Key steps include extracting raw HTML with Trafilatura, applying base filtering for adult content and language, and implementing a refined within-dump deduplication strategy after an initial global deduplication attempt proved detrimental. Additional quality filters, including C4-style and custom rules, were applied. FineWeb-Edu further leverages Llama 3 70B Instruct to score 500,000 samples for educational value, training a smaller classifier to filter the full dataset, resulting in models that outperform those trained on the larger FineWeb and other open datasets like RefinedWeb and C4. The project also notes an observed increase in Common Crawl data quality since 2022, potentially linked to the rise of LLM-generated synthetic data.

Key takeaway

For machine learning engineers building custom LLMs, understanding the FineWeb dataset creation process is crucial. You should prioritize raw data extraction and carefully consider deduplication strategies, as aggressive global deduplication can degrade dataset quality. Leverage smaller models for iterative dataset evaluation and explore using powerful LLMs like Llama 3 70B Instruct to bootstrap specialized data filters, potentially yielding higher-performing models from smaller, higher-quality datasets like FineWeb-Edu.

Key insights

High-quality LLM pre-training datasets require meticulous extraction, nuanced deduplication, and model-based evaluation.

Principles

Method

The FineWeb pipeline involves Common Crawl WARC extraction via Trafilatura, base filtering, within-dump MinHash deduplication, and custom quality filters. FineWeb-Edu adds LLM-based scoring (Llama 3 70B Instruct) to train a smaller classifier for educational content selection.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.