How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Summary
Hugging Face researchers conducted a systematic study on synthesizing high-quality pretraining data for large language models (LLMs), generating over one trillion tokens to compare rephrasing strategies, generator models, and source data. Their findings indicate that structured output formats like tables, math problems, FAQs, and tutorials consistently outperform existing synthetic methods and curated web baselines. They also discovered that increasing generator model size beyond 1 billion parameters offers no additional benefit, and the selection of original "mix-in" data significantly impacts performance. Applying these insights, the team developed FinePhrase, a 486-billion-token open dataset of rephrased web text, which surpasses all current synthetic data baselines while reducing generation costs by up to 30 times. The dataset, prompts, and generation framework are openly available.
Key takeaway
For AI Engineers and Research Scientists developing LLMs, focus on crafting diverse, structured rephrasing prompts rather than relying on larger generator models. Integrating synthetic data generated with pedagogical formats (like math problems or tables) alongside high-quality original web text will yield superior model performance and significantly reduce compute costs, enabling more efficient pretraining pipelines.
Key insights
Structured pedagogical formats and diverse outputs are key to high-quality synthetic pretraining data, not larger generator models.
Principles
- Rephrasing prompt design is the dominant factor for downstream performance.
- Generator models beyond 1B parameters offer negligible gains for most rephrasing tasks.
- Output diversity is more critical than formatting consistency.
Method
Systematic evaluation of synthetic data generation by ablating rephrasing strategy, generator model scale/architecture, and source/mix-in data composition, validated by training 1.2B-parameter LLMs for 21B tokens.
In practice
- Prioritize structured pedagogical prompts (e.g., math, table, FAQ, tutorial).
- Use smaller, efficient generator models (e.g., 1B-parameter class) for cost-effective synthesis.
- Always mix synthetic data with original web text to preserve NLU capabilities.
Topics
- Synthetic Data Generation
- LLM Pretraining Data
- Prompt Design Strategies
- Generator Model Scale
- FinePhrase Dataset
Code references
- huggingface/finephrase
- huggingface/nanotron
- huggingface/datatrove
- huggingface/lighteval
- ibm-granite/granite-3.0-language-models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.